Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing

5 Mar 2025

On Mon, Mar 3, 2025 at 1:29 PM Peter Xu peterx@redhat.com wrote:
...
On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
...
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace.  However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables.  In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:

userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
(as is fault support in general)
Only `page missing` event is currently supported
Userspace is supposed to respond to the event with the `write`
syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
process.   Note that we can't use `UFFDIO_COPY` here because
userfaulfd code does not know how to prepare guest_memfd pages, eg
remove them from direct map [4].

Not included in this series:

Proper interface for userfaultfd to recognise guest_memfd mappings
Proper handling of truncation cases after locking the page

Request for comments:

Is it a sensible workflow for guest_memfd to resolve a userfault
`page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
of the alternatives is teaching `UFFDIO_COPY` how to deal with
guest_memfd pages.

Probably not..  I don't see what protects a thread fault concurrently
during write() happening, seeing partial data.  Since you check the page
cache it'll let it pass, but the partial page will be faulted in there.
+1 here.
I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].
I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.
[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com/
...
I think we may need to either go with full MISSING or full MINOR traps.
I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.
I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.
...
One thing to mention is we probably need MINOR sooner or later to support
gmem huge pages.  The thing is for huge folios in gmem we can't rely on
missing in page cache, as we always need to allocate in hugetlb sizes.
...

What is a way forward to make userfaultfd code aware of guest_memfd?
I saw that Patrick hit a somewhat similar problem in [5] when trying
to use direct map manipulation functions in KVM and was pointed by
David at Elliot's guestmem library [6] that might include a shim for that.
Would the library be the right place to expose required interfaces like
`vma_is_gmem`?

Not sure what's the best to do, but IIUC the current way this series uses
may not work as long as one tries to reference a kvm symbol from core mm..
One trick I used so far is leveraging vm_ops and provide hook function to
report specialties when it's gmem.  In general, I did not yet dare to
overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
be accepted.  E.g. something like this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e742738240c..b068bb79fdbc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -653,8 +653,26 @@ struct vm_operations_struct {
         */
        struct page *(*find_special_page)(struct vm_area_struct *vma,
                                          unsigned long addr);

  /*


   * When set, return the allowed orders bitmask in faults of mmap()


   * ranges (e.g. for follow up huge_fault() processing).  Drivers


   * can use this to bypass THP setups for specific types of VMAs.


   */


  unsigned long (*get_supported_orders)(struct vm_area_struct *vma);



};
+static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
+{

  return vma->vm_ops && vma->vm_ops->get_supported_orders;



+}



+static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
+{

  if (!vma_has_supported_orders(vma))


          return 0;


  return vma->vm_ops->get_supported_orders(vma);



+}



In my case I used that to allow gmem report huge page supports on faults.
Said that, above only existed in my own tree so far, so I also don't know
whether something like that could be accepted (even if it'll work for you).
I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing