On Wed, Sep 20, 2023 at 1:08 AM Suren Baghdasaryan surenb@google.com wrote:
On Thu, Sep 14, 2023 at 7:28 PM Jann Horn jannh@google.com wrote:
On Thu, Sep 14, 2023 at 5:26 PM Suren Baghdasaryan surenb@google.com wrote:
From: Andrea Arcangeli aarcange@redhat.com
This implements the uABI of UFFDIO_REMAP.
Notably one mode bitflag is also forwarded (and in turn known) by the lowlevel remap_pages method.
[...]
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
[...]
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
pmd_t dst_pmdval,
struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma,
unsigned long dst_addr,
unsigned long src_addr)
+{
pmd_t _dst_pmd, src_pmdval;
struct page *src_page;
struct anon_vma *src_anon_vma, *dst_anon_vma;
spinlock_t *src_ptl, *dst_ptl;
pgtable_t pgtable;
struct mmu_notifier_range range;
src_pmdval = *src_pmd;
src_ptl = pmd_lockptr(src_mm, src_pmd);
BUG_ON(!pmd_trans_huge(src_pmdval));
BUG_ON(!pmd_none(dst_pmdval));
Why can we assert that pmd_none(dst_pmdval) is true here? Can we not have concurrent faults (or userfaultfd operations) populating that PMD?
IIUC dst_pmdval is a copy of the value from dst_pmd, so that local copy should not change even if some concurrent operation changes dst_pmd. We can assert that it's pmd_none because we checked for that before calling remap_pages_huge_pmd. Later on we check if dst_pmd changed from under us (see pmd_same(*dst_pmd, dst_pmdval) check) and retry if that happened.
Oh, right, I don't know what I was thinking when I typed that.
But now I wonder about the check directly above that: What does this code do for swap PMDs? It looks like that might splat on the BUG_ON(!pmd_trans_huge(src_pmdval)). All we've checked on the path to here is that the virtual memory area is aligned, that the destination PMD is empty, and that pmd_trans_huge_lock() succeeded; but pmd_trans_huge_lock() explicitly permits swap PMDs (which is the swapped-out version of transhuge PMDs):
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; }
BUG_ON(!spin_is_locked(src_ptl));
mmap_assert_locked(src_mm);
mmap_assert_locked(dst_mm);
BUG_ON(src_addr & ~HPAGE_PMD_MASK);
BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
src_page = pmd_page(src_pmdval);
BUG_ON(!PageHead(src_page));
BUG_ON(!PageAnon(src_page));
if (unlikely(page_mapcount(src_page) != 1)) {
spin_unlock(src_ptl);
return -EBUSY;
}
get_page(src_page);
spin_unlock(src_ptl);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, src_addr,
src_addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
/* block all concurrent rmap walks */
lock_page(src_page);
/*
* split_huge_page walks the anon_vma chain without the page
* lock. Serialize against it with the anon_vma lock, the page
* lock is not enough.
*/
src_anon_vma = folio_get_anon_vma(page_folio(src_page));
if (!src_anon_vma) {
unlock_page(src_page);
put_page(src_page);
mmu_notifier_invalidate_range_end(&range);
return -EAGAIN;
}
anon_vma_lock_write(src_anon_vma);
dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
double_pt_lock(src_ptl, dst_ptl);
if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
!pmd_same(*dst_pmd, dst_pmdval) ||
page_mapcount(src_page) != 1)) {
double_pt_unlock(src_ptl, dst_ptl);
anon_vma_unlock_write(src_anon_vma);
put_anon_vma(src_anon_vma);
unlock_page(src_page);
put_page(src_page);
mmu_notifier_invalidate_range_end(&range);
return -EAGAIN;
}
BUG_ON(!PageHead(src_page));
BUG_ON(!PageAnon(src_page));
/* the PT lock is enough to keep the page pinned now */
put_page(src_page);
dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
WRITE_ONCE(src_page->mapping, (struct address_space *) dst_anon_vma);
WRITE_ONCE(src_page->index, linear_page_index(dst_vma, dst_addr));
if (!pmd_same(pmdp_huge_clear_flush(src_vma, src_addr, src_pmd),
src_pmdval))
BUG_ON(1);
I'm not sure we can assert that the PMDs are exactly equal; the CPU might have changed the A/D bits under us?
Yes. I wonder if I can simply remove the BUG_ON here like this:
src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd);
Technically we don't use src_pmdval after this but for the possible future use that would keep things correct. If A/D bits changed from under us we will still copy correct values into dst_pmd.
And when we set up the dst_pmd, we always mark it as dirty and accessed... so I guess that's fine.
_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);
pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
Are we allowed to move page tables between mm_structs on all architectures? The first example I found that looks a bit dodgy, looking through various architectures' pte_alloc_one(), is s390's page_table_alloc() which looks like page tables are tied to per-MM lists sometimes. If that's not allowed, we might have to allocate a new deposit table and free the old one or something like that.
Hmm. Yeah, looks like in the case of !CONFIG_PGSTE the table can be linked to mm->context.pgtable_list, so can't be moved to another mm. I guess I'll have to keep a pgtable allocated, ready to be deposited and free the old one. Maybe it's worth having an arch-specific function indicating whether moving a pgtable between MMs is supported? Or do it separately as an optimization. WDYT?
Hm, dunno. I guess you could have architectures opt in with some config flag similar to how flags like ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH are wired up - define it in init/Kconfig, select it in the architectures that support it, and then gate the fast version on that with #ifdef?
if (dst_mm != src_mm) {
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);
}
double_pt_unlock(src_ptl, dst_ptl);
anon_vma_unlock_write(src_anon_vma);
put_anon_vma(src_anon_vma);
/* unblock rmap walks */
unlock_page(src_page);
mmu_notifier_invalidate_range_end(&range);
return 0;
+} +#endif /* CONFIG_USERFAULTFD */
/*
- Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
[...]
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 96d9eae5c7cc..0cca60dfa8f8 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c
[...]
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long dst_start, unsigned long src_start,
unsigned long len, __u64 mode)
+{
[...]
if (pgprot_val(src_vma->vm_page_prot) !=
pgprot_val(dst_vma->vm_page_prot))
goto out;
Does this check intentionally allow moving pages from a PROT_READ|PROT_WRITE anonymous private VMA into a PROT_READ anonymous private VMA (on architectures like x86 and arm64 where CoW memory has the same protection flags as read-only memory), but forbid moving them from a PROT_READ|PROT_EXEC VMA into a PROT_READ VMA? I think this check needs at least a comment to explain what's going on here.
The check is simply to ensure the VMAs have the same access permissions to prevent page copies that should have different permissions. The fact that x86 and arm64 have the same protection bits for R/O and COW memory is a "side-effect" IMHO. I'm not sure what comment would be good here but I'm open to suggestions.
I'm not sure if you can do a meaningful security check on the ->vm_page_prot. I also don't think it matters for anonymous VMAs. I guess if you want to keep this check but make this behavior more consistent, you could put another check in front of this that rejects VMAs where vm_flags like VM_READ, VM_WRITE, VM_SHARED or VM_EXEC are different?
[...]
/*
* Ensure the dst_vma has a anon_vma or this page
* would get a NULL anon_vma when moved in the
* dst_vma.
*/
err = -ENOMEM;
if (unlikely(anon_vma_prepare(dst_vma)))
goto out;
for (src_addr = src_start, dst_addr = dst_start;
src_addr < src_start + len;) {
spinlock_t *ptl;
pmd_t dst_pmdval;
BUG_ON(dst_addr >= dst_start + len);
src_pmd = mm_find_pmd(src_mm, src_addr);
(this would blow up pretty badly if we could have transparent huge PUD in the region but I think that's limited to file VMAs so it's fine as it currently is)
Should I add a comment here as a warning if in the future we decide to implement support for file-backed pages?
Hm, yeah, I guess that might be a good idea.