Re: [PATCH v2 1/5] mm: userfaultfd: generic continue for non hugetlbfs

3 Apr 2025

On 02/04/2025 20:04, James Houghton wrote:
...
On Wed, Apr 2, 2025 at 9:07 AM Nikita Kalyazin kalyazin@amazon.com wrote:
...
Remove shmem-specific code from UFFDIO_CONTINUE implementation for
non-huge pages by calling vm_ops->fault().  A new VMF flag,
FAULT_FLAG_NO_USERFAULT_MINOR, is introduced to avoid recursive call to
handle_userfault().
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
include/linux/mm_types.h |  3 +++
  mm/hugetlb.c             |  2 +-
  mm/shmem.c               |  3 ++-
  mm/userfaultfd.c         | 25 ++++++++++++++++++-------
  4 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..91a00f2cd565 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1429,6 +1429,8 @@ enum tlb_flush_reason {

@FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
                   We should only access orig_pte if this flag set.


@FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.



@FAULT_FLAG_NO_USERFAULT_MINOR: The fault handler must not call userfaultfd



                            minor handler.





Perhaps instead a flag that says to avoid the userfaultfd minor fault
handler, maybe we should have a flag to indicate that vm_ops->fault()
has been called by UFFDIO_CONTINUE. See below.
...


About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
whether we would allow page faults to retry by specifying these two

@@ -1467,6 +1469,7 @@ enum fault_flag {
         FAULT_FLAG_UNSHARE =            1 << 10,
         FAULT_FLAG_ORIG_PTE_VALID =     1 << 11,
         FAULT_FLAG_VMA_LOCK =           1 << 12,

  FAULT_FLAG_NO_USERFAULT_MINOR = 1 << 13,

};
typedef unsigned int __bitwise zap_flags_t;


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 97930d44d460..ba90d48144fc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6228,7 +6228,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
                 }
             /* Check for page in userfault range. */


          if (userfaultfd_minor(vma)) {




          if (userfaultfd_minor(vma) && !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
                   folio_unlock(folio);
                   folio_put(folio);
                   /* See comment in userfaultfd_missing() block above */



diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..5e1911e39dec 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2467,7 +2467,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
         fault_mm = vma ? vma->vm_mm : NULL;
     folio = filemap_get_entry(inode->i_mapping, index);


  if (folio && vma && userfaultfd_minor(vma)) {




  if (folio && vma && userfaultfd_minor(vma) &&


      !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
           if (!xa_is_value(folio))
                   folio_put(folio);
           *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);



diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..68a995216789 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -386,24 +386,35 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
                                      unsigned long dst_addr,
                                      uffd_flags_t flags)
  {

  struct inode *inode = file_inode(dst_vma->vm_file);


  pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
   struct folio *folio;
   struct page *page;
   int ret;




  struct vm_fault vmf = {


          .vma = dst_vma,


          .address = dst_addr,


          .flags = FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE |


              FAULT_FLAG_NO_USERFAULT_MINOR,


          .pte = NULL,


          .page = NULL,


          .pgoff = linear_page_index(dst_vma, dst_addr),


  };



  if (!dst_vma->vm_ops || !dst_vma->vm_ops->fault)


          return -EINVAL;





  ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);


  /* Our caller expects us to return -EFAULT if we failed to find folio */


  if (ret == -ENOENT)




  ret = dst_vma->vm_ops->fault(&vmf);



shmem_get_folio() was being called with SGP_NOALLOC, and now it is
being called with SGP_CACHE (by shmem_fault()). This will result in a
UAPI change: UFFDIO_CONTINUE for a VA without a page in the page cache
should result in EFAULT, but now the page will be allocated.
SGP_NOALLOC was carefully chosen[1], so I think a better way to do
this will be to:

Have a FAULT_FLAG_USERFAULT_CONTINUE (or something)
In shmem_fault(), if FAULT_FLAG_USERFAULT_CONTINUE, use SGP_NOALLOC

instead of SGP_CACHE (and make sure not to drop into
handle_userfault(), of course)
Yes, that is very true.  Thanks for pointing out.  Will apply in the 
next version.
...
...

  if (ret & VM_FAULT_ERROR) {
           ret = -EFAULT;




  if (ret)
           goto out;




  }



  page = vmf.page;


  folio = page_folio(page);
   if (!folio) {



What if ret == VM_FAULT_RETRY? I think we should retry instead instead
of returning -EFAULT.
True.  I'm going to retry the vm_ops->fault() call in this case.
...
And I'm not sure how VM_FAULT_NOPAGE should be handled, like if we
need special logic for it or not.
As far as I see, the only case it can come up in shmem is during a fault 
when a hole is being punched [1].  I'm thinking of returning EAGAIN to 
the userspace if this happens.
[1] https://elixir.bootlin.com/linux/v6.14-rc6/source/mm/shmem.c#L2670
...
...
             ret = -EFAULT;
             goto out;
     }



  page = folio_file_page(folio, pgoff);
   if (PageHWPoison(page)) {
           ret = -EIO;
           goto out_release;



--
2.47.1

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 1/5] mm: userfaultfd: generic continue for non hugetlbfs

Signed-off-by: Nikita Kalyazin kalyazin@amazon.com