Re: [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI

19 Sep 2023

      On Wed, Sep 20, 2023 at 1:08 AM Suren Baghdasaryan surenb@google.com wrote:
...
On Thu, Sep 14, 2023 at 7:28 PM Jann Horn jannh@google.com wrote:
...
On Thu, Sep 14, 2023 at 5:26 PM Suren Baghdasaryan surenb@google.com wrote:
...
From: Andrea Arcangeli aarcange@redhat.com
This implements the uABI of UFFDIO_REMAP.
Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.
[...]
...
...
...
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
[...]
...
...
...
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,

                   struct mm_struct *src_mm,

                   pmd_t *dst_pmd, pmd_t *src_pmd,

                   pmd_t dst_pmdval,

                   struct vm_area_struct *dst_vma,

                   struct vm_area_struct *src_vma,

                   unsigned long dst_addr,

                   unsigned long src_addr)

+{

  pmd_t _dst_pmd, src_pmdval;

  struct page *src_page;

  struct anon_vma *src_anon_vma, *dst_anon_vma;

  spinlock_t *src_ptl, *dst_ptl;

  pgtable_t pgtable;

  struct mmu_notifier_range range;

  src_pmdval = *src_pmd;

  src_ptl = pmd_lockptr(src_mm, src_pmd);

  BUG_ON(!pmd_trans_huge(src_pmdval));

  BUG_ON(!pmd_none(dst_pmdval));

Why can we assert that pmd_none(dst_pmdval) is true here? Can we not
have concurrent faults (or userfaultfd operations) populating that
PMD?
IIUC dst_pmdval is a copy of the value from dst_pmd, so that local
copy should not change even if some concurrent operation changes
dst_pmd. We can assert that it's pmd_none because we checked for that
before calling remap_pages_huge_pmd. Later on we check if dst_pmd
changed from under us (see pmd_same(*dst_pmd, dst_pmdval) check) and
retry if that happened.
Oh, right, I don't know what I was thinking when I typed that.
But now I wonder about the check directly above that: What does this
code do for swap PMDs? It looks like that might splat on the
BUG_ON(!pmd_trans_huge(src_pmdval)). All we've checked on the path to
here is that the virtual memory area is aligned, that the destination
PMD is empty, and that pmd_trans_huge_lock() succeeded; but
pmd_trans_huge_lock() explicitly permits swap PMDs (which is the
swapped-out version of transhuge PMDs):
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
                struct vm_area_struct *vma)
{
        if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
                return __pmd_trans_huge_lock(pmd, vma);
        else
                return NULL;
}
...
...
...

  BUG_ON(!spin_is_locked(src_ptl));

  mmap_assert_locked(src_mm);

  mmap_assert_locked(dst_mm);

  BUG_ON(src_addr & ~HPAGE_PMD_MASK);

  BUG_ON(dst_addr & ~HPAGE_PMD_MASK);

  src_page = pmd_page(src_pmdval);

  BUG_ON(!PageHead(src_page));

  BUG_ON(!PageAnon(src_page));

  if (unlikely(page_mapcount(src_page) != 1)) {

          spin_unlock(src_ptl);

          return -EBUSY;

  }

  get_page(src_page);

  spin_unlock(src_ptl);

  mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, src_addr,

                          src_addr + HPAGE_PMD_SIZE);

  mmu_notifier_invalidate_range_start(&range);

  /* block all concurrent rmap walks */

  lock_page(src_page);

  /*

   * split_huge_page walks the anon_vma chain without the page

   * lock. Serialize against it with the anon_vma lock, the page

   * lock is not enough.

   */

  src_anon_vma = folio_get_anon_vma(page_folio(src_page));

  if (!src_anon_vma) {

          unlock_page(src_page);

          put_page(src_page);

          mmu_notifier_invalidate_range_end(&range);

          return -EAGAIN;

  }

  anon_vma_lock_write(src_anon_vma);

  dst_ptl = pmd_lockptr(dst_mm, dst_pmd);

  double_pt_lock(src_ptl, dst_ptl);

  if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||

               !pmd_same(*dst_pmd, dst_pmdval) ||

               page_mapcount(src_page) != 1)) {

          double_pt_unlock(src_ptl, dst_ptl);

          anon_vma_unlock_write(src_anon_vma);

          put_anon_vma(src_anon_vma);

          unlock_page(src_page);

          put_page(src_page);

          mmu_notifier_invalidate_range_end(&range);

          return -EAGAIN;

  }

  BUG_ON(!PageHead(src_page));

  BUG_ON(!PageAnon(src_page));

  /* the PT lock is enough to keep the page pinned now */

  put_page(src_page);

  dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;

  WRITE_ONCE(src_page->mapping, (struct address_space *) dst_anon_vma);

  WRITE_ONCE(src_page->index, linear_page_index(dst_vma, dst_addr));

  if (!pmd_same(pmdp_huge_clear_flush(src_vma, src_addr, src_pmd),

                src_pmdval))

          BUG_ON(1);

I'm not sure we can assert that the PMDs are exactly equal; the CPU
might have changed the A/D bits under us?
Yes. I wonder if I can simply remove the BUG_ON here like this:
src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd);
Technically we don't use src_pmdval after this but for the possible
future use that would keep things correct. If A/D bits changed from
under us we will still copy correct values into dst_pmd.
And when we set up the dst_pmd, we always mark it as dirty and
accessed... so I guess that's fine.
...
...
...

  _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);

  _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);

  set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);

  pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);

  pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);

Are we allowed to move page tables between mm_structs on all
architectures? The first example I found that looks a bit dodgy,
looking through various architectures' pte_alloc_one(), is s390's
page_table_alloc() which looks like page tables are tied to per-MM
lists sometimes.
If that's not allowed, we might have to allocate a new deposit table
and free the old one or something like that.
Hmm. Yeah, looks like in the case of !CONFIG_PGSTE the table can be
linked to mm->context.pgtable_list, so can't be moved to another mm. I
guess I'll have to keep a pgtable allocated, ready to be deposited and
free the old one. Maybe it's worth having an arch-specific function
indicating whether moving a pgtable between MMs is supported? Or do it
separately as an optimization. WDYT?
Hm, dunno. I guess you could have architectures opt in with some
config flag similar to how flags like
ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH are wired up - define it in
init/Kconfig, select it in the architectures that support it, and then
gate the fast version on that with #ifdef?
...
...
...

  if (dst_mm != src_mm) {

          add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);

          add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);

  }

  double_pt_unlock(src_ptl, dst_ptl);

  anon_vma_unlock_write(src_anon_vma);

  put_anon_vma(src_anon_vma);

  /* unblock rmap walks */

  unlock_page(src_page);

  mmu_notifier_invalidate_range_end(&range);

  return 0;

+}
+#endif /* CONFIG_USERFAULTFD */

/*

Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.

[...]
...
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 96d9eae5c7cc..0cca60dfa8f8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
[...]
...
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,

              unsigned long dst_start, unsigned long src_start,

              unsigned long len, __u64 mode)

+{
[...]
...

  if (pgprot_val(src_vma->vm_page_prot) !=

      pgprot_val(dst_vma->vm_page_prot))

          goto out;

Does this check intentionally allow moving pages from a
PROT_READ|PROT_WRITE anonymous private VMA into a PROT_READ anonymous
private VMA (on architectures like x86 and arm64 where CoW memory has
the same protection flags as read-only memory), but forbid moving them
from a PROT_READ|PROT_EXEC VMA into a PROT_READ VMA? I think this
check needs at least a comment to explain what's going on here.
The check is simply to ensure the VMAs have the same access
permissions to prevent page copies that should have different
permissions. The fact that x86 and arm64 have the same protection bits
for R/O and COW memory is a "side-effect" IMHO. I'm not sure what
comment would be good here but I'm open to suggestions.
I'm not sure if you can do a meaningful security check on the
->vm_page_prot. I also don't think it matters for anonymous VMAs.
I guess if you want to keep this check but make this behavior more
consistent, you could put another check in front of this that rejects
VMAs where vm_flags like VM_READ, VM_WRITE, VM_SHARED or VM_EXEC are
different?
[...]
...
...
...

  /*

   * Ensure the dst_vma has a anon_vma or this page

   * would get a NULL anon_vma when moved in the

   * dst_vma.

   */

  err = -ENOMEM;

  if (unlikely(anon_vma_prepare(dst_vma)))

          goto out;

  for (src_addr = src_start, dst_addr = dst_start;

       src_addr < src_start + len;) {

          spinlock_t *ptl;

          pmd_t dst_pmdval;

          BUG_ON(dst_addr >= dst_start + len);

          src_pmd = mm_find_pmd(src_mm, src_addr);

(this would blow up pretty badly if we could have transparent huge PUD
in the region but I think that's limited to file VMAs so it's fine as
it currently is)
Should I add a comment here as a warning if in the future we decide to
implement support for file-backed pages?
Hm, yeah, I guess that might be a good idea.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI