This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd: - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` (as is fault support in general) - Only `page missing` event is currently supported - Userspace is supposed to respond to the event with the `write` syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting process. Note that we can't use `UFFDIO_COPY` here because userfaulfd code does not know how to prepare guest_memfd pages, eg remove them from direct map [4].
Not included in this series: - Proper interface for userfaultfd to recognise guest_memfd mappings - Proper handling of truncation cases after locking the page
Request for comments: - Is it a sensible workflow for guest_memfd to resolve a userfault `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One of the alternatives is teaching `UFFDIO_COPY` how to deal with guest_memfd pages. - What is a way forward to make userfaultfd code aware of guest_memfd? I saw that Patrick hit a somewhat similar problem in [5] when trying to use direct map manipulation functions in KVM and was pointed by David at Elliot's guestmem library [6] that might include a shim for that. Would the library be the right place to expose required interfaces like `vma_is_gmem`?
Nikita
[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/ [4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#... [5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@quic...
Nikita Kalyazin (5): KVM: guest_memfd: add kvm_gmem_vma_is_gmem KVM: guest_memfd: add support for uffd missing mm: userfaultfd: allow to register userfaultfd for guest_memfd mm: userfaultfd: support continue for guest_memfd KVM: selftests: add uffd missing test for guest_memfd
include/linux/userfaultfd_k.h | 9 ++ mm/userfaultfd.c | 23 ++++- .../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++ virt/kvm/guest_memfd.c | 17 +++- virt/kvm/kvm_mm.h | 1 + 5 files changed, 136 insertions(+), 2 deletions(-)
base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223
It will be used to distinguish the vma type in userfaultfd code. This likely needs to be done in the guestmem library.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- virt/kvm/guest_memfd.c | 5 +++++ virt/kvm/kvm_mm.h | 1 + 2 files changed, 6 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index f93fe5835173..af825f7494ea 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -500,6 +500,11 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf, return ret && start == (*offset >> PAGE_SHIFT) ? ret : *offset - (start << PAGE_SHIFT); } + +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma) +{ + return vma->vm_ops == &kvm_gmem_vm_ops; +} #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
static struct file_operations kvm_gmem_fops = { diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index acef3f5c582a..09fcdf18a072 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -73,6 +73,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); #else static inline void kvm_gmem_init(struct module *module) {
Add support for sending a pagefault event if userfaultfd is registered. Only page missing event is currently supported.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- virt/kvm/guest_memfd.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index af825f7494ea..358c3776ed66 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,9 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#ifdef CONFIG_KVM_PRIVATE_MEM +#include <linux/userfaultfd_k.h> +#endif /* CONFIG_KVM_PRIVATE_MEM */
#include "kvm_mm.h"
@@ -332,9 +335,16 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) struct folio *folio; vm_fault_t ret = VM_FAULT_LOCKED;
+ folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); + if (!folio && userfaultfd_missing(vmf->vma)) + return handle_userfault(vmf, VM_UFFD_MISSING); + if (folio) + folio_lock(folio); + filemap_invalidate_lock_shared(inode->i_mapping);
- folio = kvm_gmem_get_folio(inode, vmf->pgoff); + if (!folio) + folio = kvm_gmem_get_folio(inode, vmf->pgoff); if (IS_ERR(folio)) { switch (PTR_ERR(folio)) { case -EAGAIN:
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- include/linux/userfaultfd_k.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75342022d144..440d38903359 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -20,6 +20,10 @@ #include <asm-generic/pgtable_uffd.h> #include <linux/hugetlb_inline.h>
+#ifdef CONFIG_KVM_PRIVATE_MEM +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); +#endif + /* The set of all possible UFFD-related VM flags. */ #define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -242,6 +246,11 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, return false; #endif
+#ifdef CONFIG_KVM_PRIVATE_MEM + if (kvm_gmem_vma_is_gmem(vma)) + return true; +#endif + /* By default, allow any of anon|shmem|hugetlb */ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
When userspace receives a page missing event, it is supposed to populate the missing page in guest_memfd pagecache via the write syscall and unblock the faulting process via UFFDIO_CONTINUE.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- mm/userfaultfd.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index af3dfc3633db..aaff66a7f15b 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -19,6 +19,10 @@ #include <asm/tlb.h> #include "internal.h"
+#ifdef CONFIG_KVM_PRIVATE_MEM +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); +#endif + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -391,6 +395,16 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct page *page; int ret;
+#ifdef CONFIG_KVM_PRIVATE_MEM + if (kvm_gmem_vma_is_gmem(dst_vma)) { + ret = 0; + folio = filemap_get_entry(inode->i_mapping, pgoff); + if (IS_ERR(folio)) + ret = PTR_ERR(folio); + else + folio_lock(folio); + } else +#endif ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) @@ -769,9 +783,16 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma) +#ifdef CONFIG_KVM_PRIVATE_MEM + && !kvm_gmem_vma_is_gmem(dst_vma) +#endif + ) goto out_unlock; if (!vma_is_shmem(dst_vma) && +#ifdef CONFIG_KVM_PRIVATE_MEM + !kvm_gmem_vma_is_gmem(dst_vma) && +#endif uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock;
The test demonstrates how a page missing event can be resolved via write syscall followed by UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- .../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++ 1 file changed, 88 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index b07221aa54c9..aea0e8627981 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -10,12 +10,16 @@ #include <errno.h> #include <stdio.h> #include <fcntl.h> +#include <pthread.h>
#include <linux/bitmap.h> #include <linux/falloc.h> +#include <linux/userfaultfd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/ioctl.h>
#include "kvm_util.h" #include "test_util.h" @@ -278,6 +282,88 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm) close(fd1); }
+struct fault_args { + char *addr; + volatile char value; +}; + +static void *fault_thread_fn(void *arg) +{ + struct fault_args *args = arg; + + /* Trigger page fault */ + args->value = *args->addr; + return NULL; +} + +static void test_uffd_missing(int fd, size_t page_size, size_t total_size) +{ + struct uffdio_register uffd_reg; + struct uffdio_continue uffd_cont; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *buf = NULL; + int uffd, ret; + off_t offset = page_size; + void *fault_addr; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed"); + + struct uffdio_api uffdio_api = { + .api = UFFD_API, + .features = UFFD_FEATURE_MISSING_SHMEM, + }; + ret = ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed"); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed"); + + uffd_reg.range.start = (unsigned long)mem; + uffd_reg.range.len = total_size; + uffd_reg.mode = UFFDIO_REGISTER_MODE_MISSING; + ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, + offset, page_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + fault_addr = mem + offset; + args.addr = fault_addr; + + ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret == 0, "pthread_create should succeed"); + + ret = read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret != -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr, + "pagefault should occur at expected address"); + + ret = pwrite(fd, buf + offset, page_size, offset); + TEST_ASSERT(ret == page_size, "write should succeed"); + + uffd_cont.range.start = (unsigned long)fault_addr; + uffd_cont.range.len = page_size; + uffd_cont.mode = 0; + ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed"); + + ret = pthread_join(fault_thread, NULL); + TEST_ASSERT(ret == 0, "pthread_join should succeed"); + + ret = munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + unsigned long get_shared_type(void) { #ifdef __x86_64__ @@ -316,6 +402,8 @@ void test_vm_type(unsigned long type, bool is_shared) test_file_size(fd, page_size, total_size); test_fallocate(fd, page_size, total_size); test_invalid_punch_hole(fd, page_size, total_size); + if (is_shared) + test_uffd_missing(fd, page_size, total_size);
close(fd); kvm_vm_release(vm);
On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` (as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write` syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting process. Note that we can't use `UFFDIO_COPY` here because userfaulfd code does not know how to prepare guest_memfd pages, eg remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One of the alternatives is teaching `UFFDIO_COPY` how to deal with guest_memfd pages.
Probably not.. I don't see what protects a thread fault concurrently during write() happening, seeing partial data. Since you check the page cache it'll let it pass, but the partial page will be faulted in there.
I think we may need to either go with full MISSING or full MINOR traps.
One thing to mention is we probably need MINOR sooner or later to support gmem huge pages. The thing is for huge folios in gmem we can't rely on missing in page cache, as we always need to allocate in hugetlb sizes.
- What is a way forward to make userfaultfd code aware of guest_memfd? I saw that Patrick hit a somewhat similar problem in [5] when trying to use direct map manipulation functions in KVM and was pointed by David at Elliot's guestmem library [6] that might include a shim for that. Would the library be the right place to expose required interfaces like `vma_is_gmem`?
Not sure what's the best to do, but IIUC the current way this series uses may not work as long as one tries to reference a kvm symbol from core mm..
One trick I used so far is leveraging vm_ops and provide hook function to report specialties when it's gmem. In general, I did not yet dare to overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to be accepted. E.g. something like this:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 5e742738240c..b068bb79fdbc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -653,8 +653,26 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + /* + * When set, return the allowed orders bitmask in faults of mmap() + * ranges (e.g. for follow up huge_fault() processing). Drivers + * can use this to bypass THP setups for specific types of VMAs. + */ + unsigned long (*get_supported_orders)(struct vm_area_struct *vma); };
+static inline bool vma_has_supported_orders(struct vm_area_struct *vma) +{ + return vma->vm_ops && vma->vm_ops->get_supported_orders; +} + +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma) +{ + if (!vma_has_supported_orders(vma)) + return 0; + return vma->vm_ops->get_supported_orders(vma); +} +
In my case I used that to allow gmem report huge page supports on faults.
Said that, above only existed in my own tree so far, so I also don't know whether something like that could be accepted (even if it'll work for you).
Thanks,
On Mon, Mar 3, 2025 at 1:29 PM Peter Xu peterx@redhat.com wrote:
On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` (as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write` syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting process. Note that we can't use `UFFDIO_COPY` here because userfaulfd code does not know how to prepare guest_memfd pages, eg remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One of the alternatives is teaching `UFFDIO_COPY` how to deal with guest_memfd pages.
Probably not.. I don't see what protects a thread fault concurrently during write() happening, seeing partial data. Since you check the page cache it'll let it pass, but the partial page will be faulted in there.
+1 here.
I think the simplest way to make it work would be to also check folio_test_uptodate() in the userfaultfd_missing() check[1]. It would pair with kvm_gmem_mark_prepared() in the write() path[2].
I'm not sure if that's the "right" way, I think it would prevent threads from reading data as it is written.
[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com/ [2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com/
I think we may need to either go with full MISSING or full MINOR traps.
I agree, and just to clarify: you've basically implemented the MISSING model, just using write() to resolve userfaults instead of UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really doing much; when the page cache has a page, the fault handler will populate the PTE for you.
I think it's probably simpler to implement the MINOR model, where userspace can populate the page cache however it wants; write() is perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs for gmem, and the fault handler needs to check for the presence of PTEs. The `struct vm_fault` you have should contain enough info.
One thing to mention is we probably need MINOR sooner or later to support gmem huge pages. The thing is for huge folios in gmem we can't rely on missing in page cache, as we always need to allocate in hugetlb sizes.
- What is a way forward to make userfaultfd code aware of guest_memfd? I saw that Patrick hit a somewhat similar problem in [5] when trying to use direct map manipulation functions in KVM and was pointed by David at Elliot's guestmem library [6] that might include a shim for that. Would the library be the right place to expose required interfaces like `vma_is_gmem`?
Not sure what's the best to do, but IIUC the current way this series uses may not work as long as one tries to reference a kvm symbol from core mm..
One trick I used so far is leveraging vm_ops and provide hook function to report specialties when it's gmem. In general, I did not yet dare to overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to be accepted. E.g. something like this:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 5e742738240c..b068bb79fdbc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -653,8 +653,26 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr);
/*
* When set, return the allowed orders bitmask in faults of mmap()
* ranges (e.g. for follow up huge_fault() processing). Drivers
* can use this to bypass THP setups for specific types of VMAs.
*/
unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
};
+static inline bool vma_has_supported_orders(struct vm_area_struct *vma) +{
return vma->vm_ops && vma->vm_ops->get_supported_orders;
+}
+static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma) +{
if (!vma_has_supported_orders(vma))
return 0;
return vma->vm_ops->get_supported_orders(vma);
+}
In my case I used that to allow gmem report huge page supports on faults.
Said that, above only existed in my own tree so far, so I also don't know whether something like that could be accepted (even if it'll work for you).
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
On 05/03/2025 20:29, Peter Xu wrote:
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
Hi Peter, hi James. Thanks for pointing at the race condition!
I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though.
Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem?
Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race?
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)?
if (folio) { folio_lock(folio);
/* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); }
Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue().
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
-- Peter Xu
On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
On 05/03/2025 20:29, Peter Xu wrote:
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
Hi Peter, hi James. Thanks for pointing at the race condition!
I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though.
It sounds working to me. Adding fault flag can also be seen as part of extension of vm_operations_struct ops. So we could consider reusing fault() API indeed.
Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work
It could be confusing when reading that chunk of code, because it looks like it notifies minor fault when cache hit. But the critical part here is that we rely on the pgtable missing causing the fault() to trigger first. So it's more like "cache hit && pgtable missing" for minor fault.
with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem?
Should be easier we stick with mmap() rather than write(). E.g. for shmem case of current code base:
if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; }
vma is only availble if vmf!=NULL, aka in fault context. With that, in write() to shmem inodes, nothing will generate a message, because minor fault so far is only about pgtable missing. It needs to be mmap()ed first, and has nothing yet to do with write() syscalls.
Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race?
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)?
if (folio) { folio_lock(folio);
/* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio);
}
Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue().
I think you meant the UFFDIO_CONTINUE side of things. I agree with you, we always need the folio lock.
What I was saying is the trapping side, where the minor fault message can be generated without the folio lock now in case of shmem. It's about whether we could generalize the trapping side, so handle_mm_fault() can generate the minor fault message instead of by shmem.c.
If the only concern is "referring to a module symbol from core mm", then indeed the trapping side should be less of a concern anyway, because the trapping side (when in the module codes) should always be able to reference mm functions.
Actually.. if we have a fault() flag introduced above, maybe we can generalize the trap side altogether without the folio lock overhead. When the flag set, if we can always return the folio unlocked (as long as refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
-- Peter Xu
On 10/03/2025 19:57, Peter Xu wrote:
On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
On 05/03/2025 20:29, Peter Xu wrote:
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
Hi Peter, hi James. Thanks for pointing at the race condition!
I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though.
It sounds working to me. Adding fault flag can also be seen as part of extension of vm_operations_struct ops. So we could consider reusing fault() API indeed.
Great!
Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work
It could be confusing when reading that chunk of code, because it looks like it notifies minor fault when cache hit. But the critical part here is that we rely on the pgtable missing causing the fault() to trigger first. So it's more like "cache hit && pgtable missing" for minor fault.
Right, but the cache hit still looks like a precondition for the minor fault event?
with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem?
Should be easier we stick with mmap() rather than write(). E.g. for shmem case of current code base:
if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; }
vma is only availble if vmf!=NULL, aka in fault context. With that, in write() to shmem inodes, nothing will generate a message, because minor fault so far is only about pgtable missing. It needs to be mmap()ed first, and has nothing yet to do with write() syscalls.
Yes, that's true that write() itself isn't going to generate a message. My idea was to _respond_ to a message generated by the fault handler (vmf != NULL) with a write(). I didn't mean to generate it from write().
What I wanted to achieve was send a message on fault + cache miss and respond to the message with a write() to fill the cache followed by a UFFDIO_CONTINUE to set up pagetables. I understand that a MINOR trap (MINOR + UFFDIO_CONTINUE) is preferable, but how does it fit into this model? What/how will guarantee a cache hit that would trigger the MINOR message?
To clarify, I would like to be able to populate pages _on-demand_, not only proactively (like in the original UFFDIO_CONTINUE cover letter [1]). Do you think the MINOR trap could still be applicable or would it necessarily require the MISSING trap?
[1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@...
Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race?
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)?
if (folio) { folio_lock(folio); /* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); }
Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue().
I think you meant the UFFDIO_CONTINUE side of things. I agree with you, we always need the folio lock.
What I was saying is the trapping side, where the minor fault message can be generated without the folio lock now in case of shmem. It's about whether we could generalize the trapping side, so handle_mm_fault() can generate the minor fault message instead of by shmem.c.
If the only concern is "referring to a module symbol from core mm", then indeed the trapping side should be less of a concern anyway, because the trapping side (when in the module codes) should always be able to reference mm functions.
Actually.. if we have a fault() flag introduced above, maybe we can generalize the trap side altogether without the folio lock overhead. When the flag set, if we can always return the folio unlocked (as long as refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
Where does this locking happen exactly during trapping? I was thinking it was only done when the page was allocated. The trapping part (quoted by you above) only looks up the page in the cache and calls handle_userfault(). Am I missing something?
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
-- Peter Xu
-- Peter Xu
On Tue, Mar 11, 2025 at 04:56:47PM +0000, Nikita Kalyazin wrote:
On 10/03/2025 19:57, Peter Xu wrote:
On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
On 05/03/2025 20:29, Peter Xu wrote:
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
Hi Peter, hi James. Thanks for pointing at the race condition!
I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though.
It sounds working to me. Adding fault flag can also be seen as part of extension of vm_operations_struct ops. So we could consider reusing fault() API indeed.
Great!
Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work
It could be confusing when reading that chunk of code, because it looks like it notifies minor fault when cache hit. But the critical part here is that we rely on the pgtable missing causing the fault() to trigger first. So it's more like "cache hit && pgtable missing" for minor fault.
Right, but the cache hit still looks like a precondition for the minor fault event?
Yes.
with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem?
Should be easier we stick with mmap() rather than write(). E.g. for shmem case of current code base:
if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; }
vma is only availble if vmf!=NULL, aka in fault context. With that, in write() to shmem inodes, nothing will generate a message, because minor fault so far is only about pgtable missing. It needs to be mmap()ed first, and has nothing yet to do with write() syscalls.
Yes, that's true that write() itself isn't going to generate a message. My idea was to _respond_ to a message generated by the fault handler (vmf != NULL) with a write(). I didn't mean to generate it from write().
What I wanted to achieve was send a message on fault + cache miss and respond to the message with a write() to fill the cache followed by a UFFDIO_CONTINUE to set up pagetables. I understand that a MINOR trap (MINOR
- UFFDIO_CONTINUE) is preferable, but how does it fit into this model?
What/how will guarantee a cache hit that would trigger the MINOR message?
To clarify, I would like to be able to populate pages _on-demand_, not only proactively (like in the original UFFDIO_CONTINUE cover letter [1]). Do you think the MINOR trap could still be applicable or would it necessarily require the MISSING trap?
I think MINOR can also achieve similar things. MINOR traps the pgtable missing event (let's imagine page cache is already populated, or at least when MISSING mode not registered, it'll auto-populate on 1st access). So as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
Said that, we can also choose to implement MISSING first. In that case write() is definitely not enough, because MISSING is at least so far based on top of whether the page cache present, and write() won't be atomic on update a page. We need to implement UFFDIO_COPY for gmemfd MISSING.
Either way looks ok to me.
[1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@...
Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race?
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)?
if (folio) { folio_lock(folio); /* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); }
[1]
Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue().
I think you meant the UFFDIO_CONTINUE side of things. I agree with you, we always need the folio lock.
What I was saying is the trapping side, where the minor fault message can be generated without the folio lock now in case of shmem. It's about whether we could generalize the trapping side, so handle_mm_fault() can generate the minor fault message instead of by shmem.c.
If the only concern is "referring to a module symbol from core mm", then indeed the trapping side should be less of a concern anyway, because the trapping side (when in the module codes) should always be able to reference mm functions.
Actually.. if we have a fault() flag introduced above, maybe we can generalize the trap side altogether without the folio lock overhead. When the flag set, if we can always return the folio unlocked (as long as refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
Where does this locking happen exactly during trapping? I was thinking it was only done when the page was allocated. The trapping part (quoted by you above) only looks up the page in the cache and calls handle_userfault(). Am I missing something?
That's only what I worry if we want to reuse fault() to generalize the trap code in core mm, because fault() by default takes the folio lock at least for shmem. I agree the folio doesn't need locking when trapping the fault and sending the message.
Thanks,
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
-- Peter Xu
-- Peter Xu
On 12/03/2025 15:45, Peter Xu wrote:
On Tue, Mar 11, 2025 at 04:56:47PM +0000, Nikita Kalyazin wrote:
On 10/03/2025 19:57, Peter Xu wrote:
On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
On 05/03/2025 20:29, Peter Xu wrote:
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...
Yes I also thought about that. :)
Hi Peter, hi James. Thanks for pointing at the race condition!
I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though.
It sounds working to me. Adding fault flag can also be seen as part of extension of vm_operations_struct ops. So we could consider reusing fault() API indeed.
Great!
Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work
It could be confusing when reading that chunk of code, because it looks like it notifies minor fault when cache hit. But the critical part here is that we rely on the pgtable missing causing the fault() to trigger first. So it's more like "cache hit && pgtable missing" for minor fault.
Right, but the cache hit still looks like a precondition for the minor fault event?
Yes.
with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem?
Should be easier we stick with mmap() rather than write(). E.g. for shmem case of current code base:
if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; }
vma is only availble if vmf!=NULL, aka in fault context. With that, in write() to shmem inodes, nothing will generate a message, because minor fault so far is only about pgtable missing. It needs to be mmap()ed first, and has nothing yet to do with write() syscalls.
Yes, that's true that write() itself isn't going to generate a message. My idea was to _respond_ to a message generated by the fault handler (vmf != NULL) with a write(). I didn't mean to generate it from write().
What I wanted to achieve was send a message on fault + cache miss and respond to the message with a write() to fill the cache followed by a UFFDIO_CONTINUE to set up pagetables. I understand that a MINOR trap (MINOR
- UFFDIO_CONTINUE) is preferable, but how does it fit into this model?
What/how will guarantee a cache hit that would trigger the MINOR message?
To clarify, I would like to be able to populate pages _on-demand_, not only proactively (like in the original UFFDIO_CONTINUE cover letter [1]). Do you think the MINOR trap could still be applicable or would it necessarily require the MISSING trap?
I think MINOR can also achieve similar things. MINOR traps the pgtable missing event (let's imagine page cache is already populated, or at least when MISSING mode not registered, it'll auto-populate on 1st access). So
However if MISSING is not registered, the kernel will auto-populate with a clear page, ie there is no way to inject custom content from userspace. To explain my use case a bit more, the population thread will be trying to copy all guest memory proactively, but there will inevitably be cases where a page is accessed through pgtables _before_ it gets populated. It is not desirable for such access to result in a clear page provided by the kernel.
as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
Said that, we can also choose to implement MISSING first. In that case write() is definitely not enough, because MISSING is at least so far based on top of whether the page cache present, and write() won't be atomic on update a page. We need to implement UFFDIO_COPY for gmemfd MISSING.
Either way looks ok to me.
Yes, I understand that write() doesn't provide an atomic way of alloc + add + install PTE. Supporting UFFDIO_COPY is much more involved as it currently provides implementations specific to anonymous and shared memory, and adding guest_memfd to it brings the problem of the dependency on KVM back. I suppose it's possible to abstract those by introducing extra callbacks in vm_ops somehow and make the code generic, but it would be a significant change. If this is the only right way to address my use case, I will work on it.
[1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@...
Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race?
When Axel added minor fault, it's not a major concern as it's the only fs that will consume the feature anyway in the do_fault() path - hugetlbfs has its own path to take care of.. even until now.
And there's some valid points too if someone would argue to put it there especially on folio lock - do that in shmem.c can avoid taking folio lock when generating minor fault message. It might make some difference when the faults are heavy and when folio lock is frequently taken elsewhere too.
Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)?
if (folio) { folio_lock(folio); /* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); }
[1]
Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue().
I think you meant the UFFDIO_CONTINUE side of things. I agree with you, we always need the folio lock.
What I was saying is the trapping side, where the minor fault message can be generated without the folio lock now in case of shmem. It's about whether we could generalize the trapping side, so handle_mm_fault() can generate the minor fault message instead of by shmem.c.
If the only concern is "referring to a module symbol from core mm", then indeed the trapping side should be less of a concern anyway, because the trapping side (when in the module codes) should always be able to reference mm functions.
Actually.. if we have a fault() flag introduced above, maybe we can generalize the trap side altogether without the folio lock overhead. When the flag set, if we can always return the folio unlocked (as long as refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
Where does this locking happen exactly during trapping? I was thinking it was only done when the page was allocated. The trapping part (quoted by you above) only looks up the page in the cache and calls handle_userfault(). Am I missing something?
That's only what I worry if we want to reuse fault() to generalize the trap code in core mm, because fault() by default takes the folio lock at least for shmem. I agree the folio doesn't need locking when trapping the fault and sending the message.
Ok, I think I understand what you mean now. Thanks for explaining that.
Thanks,
It might boil down to how many more FSes would support minor fault, and whether we would care about such difference at last to shmem users. If gmem is the only one after existing ones, IIUC there's still option we implement it in gmem code. After all, I expect the change should be very under control (<20 LOCs?)..
-- Peter Xu
-- Peter Xu
-- Peter Xu
On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
However if MISSING is not registered, the kernel will auto-populate with a clear page, ie there is no way to inject custom content from userspace. To explain my use case a bit more, the population thread will be trying to copy all guest memory proactively, but there will inevitably be cases where a page is accessed through pgtables _before_ it gets populated. It is not desirable for such access to result in a clear page provided by the kernel.
IMHO populating with a zero page in the page cache is fine. It needs to make sure all accesses will go via the pgtable, as discussed below in my previous email [1], then nobody will be able to see the zero page, not until someone updates the content then follow up with a CONTINUE to install the pgtable entry.
If there is any way that the page can be accessed without the pgtable installation, minor faults won't work indeed.
as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
[1]
On 12/03/2025 19:32, Peter Xu wrote:
On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
However if MISSING is not registered, the kernel will auto-populate with a clear page, ie there is no way to inject custom content from userspace. To explain my use case a bit more, the population thread will be trying to copy all guest memory proactively, but there will inevitably be cases where a page is accessed through pgtables _before_ it gets populated. It is not desirable for such access to result in a clear page provided by the kernel.
IMHO populating with a zero page in the page cache is fine. It needs to make sure all accesses will go via the pgtable, as discussed below in my previous email [1], then nobody will be able to see the zero page, not until someone updates the content then follow up with a CONTINUE to install the pgtable entry.
If there is any way that the page can be accessed without the pgtable installation, minor faults won't work indeed.
I think I see what you mean now. I agree, it isn't the end of the world if the kernel clears the page and then userspace overwrites it.
The way I see it is:
@@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) if (WARN_ON_ONCE(folio_test_large(folio))) { ret = VM_FAULT_SIGBUS; goto out_folio; }
if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); kvm_gmem_mark_prepared(folio); }
+ if (userfaultfd_minor(vmf->vma)) { + folio_unlock(folio); + filemap_invalidate_unlock_shared(inode->i_mapping); + return handle_userfault(vmf, VM_UFFD_MISSING); + } + vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio: if (ret != VM_FAULT_LOCKED) { folio_unlock(folio); folio_put(folio); }
On the first fault (cache miss), the kernel will allocate/add/clear the page (as there is no MISSING trap now), and once the page is in the cache, a MINOR event will be sent for userspace to copy its content. Please let me know if this is an acceptable semantics.
Since userspace is getting notified after KVM calls kvm_gmem_mark_prepared(), which removes the page from the direct map [1], userspace can't use write() to populate the content because write() relies on direct map [2]. However userspace can do a plain memcpy that would use user pagetables instead. This forces userspace to respond to stage-2 and VMA faults in guest_memfd differently, via write() and memcpy respectively. It doesn't seem like a significant problem though.
I believe, with this approach the original race condition is gone because UFFD messages are only sent on cache hit and it is up to userspace to serialise writes. Please correct me if I'm wrong here.
[1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#... [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf...
as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
[1]
-- Peter Xu
On Thu, Mar 13, 2025 at 03:25:16PM +0000, Nikita Kalyazin wrote:
On 12/03/2025 19:32, Peter Xu wrote:
On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
However if MISSING is not registered, the kernel will auto-populate with a clear page, ie there is no way to inject custom content from userspace. To explain my use case a bit more, the population thread will be trying to copy all guest memory proactively, but there will inevitably be cases where a page is accessed through pgtables _before_ it gets populated. It is not desirable for such access to result in a clear page provided by the kernel.
IMHO populating with a zero page in the page cache is fine. It needs to make sure all accesses will go via the pgtable, as discussed below in my previous email [1], then nobody will be able to see the zero page, not until someone updates the content then follow up with a CONTINUE to install the pgtable entry.
If there is any way that the page can be accessed without the pgtable installation, minor faults won't work indeed.
I think I see what you mean now. I agree, it isn't the end of the world if the kernel clears the page and then userspace overwrites it.
The way I see it is:
@@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) if (WARN_ON_ONCE(folio_test_large(folio))) { ret = VM_FAULT_SIGBUS; goto out_folio; }
if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); kvm_gmem_mark_prepared(folio); }
if (userfaultfd_minor(vmf->vma)) {
folio_unlock(folio);
filemap_invalidate_unlock_shared(inode->i_mapping);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
I suppose you meant s/MISSING/MINOR/.
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio: if (ret != VM_FAULT_LOCKED) { folio_unlock(folio); folio_put(folio); }
On the first fault (cache miss), the kernel will allocate/add/clear the page (as there is no MISSING trap now), and once the page is in the cache, a MINOR event will be sent for userspace to copy its content. Please let me know if this is an acceptable semantics.
Since userspace is getting notified after KVM calls kvm_gmem_mark_prepared(), which removes the page from the direct map [1], userspace can't use write() to populate the content because write() relies on direct map [2]. However userspace can do a plain memcpy that would use user pagetables instead. This forces userspace to respond to stage-2 and VMA faults in guest_memfd differently, via write() and memcpy respectively. It doesn't seem like a significant problem though.
It looks ok in general, but could you remind me why you need to stick with write() syscall?
IOW, if gmemfd will always need mmap() and it's fully accessible from userspace in your use case, wouldn't mmap()+memcpy() always work already, and always better than write()?
Thanks,
I believe, with this approach the original race condition is gone because UFFD messages are only sent on cache hit and it is up to userspace to serialise writes. Please correct me if I'm wrong here.
[1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#... [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf...
as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
[1]
-- Peter Xu
On 13/03/2025 19:12, Peter Xu wrote:
On Thu, Mar 13, 2025 at 03:25:16PM +0000, Nikita Kalyazin wrote:
On 12/03/2025 19:32, Peter Xu wrote:
On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
However if MISSING is not registered, the kernel will auto-populate with a clear page, ie there is no way to inject custom content from userspace. To explain my use case a bit more, the population thread will be trying to copy all guest memory proactively, but there will inevitably be cases where a page is accessed through pgtables _before_ it gets populated. It is not desirable for such access to result in a clear page provided by the kernel.
IMHO populating with a zero page in the page cache is fine. It needs to make sure all accesses will go via the pgtable, as discussed below in my previous email [1], then nobody will be able to see the zero page, not until someone updates the content then follow up with a CONTINUE to install the pgtable entry.
If there is any way that the page can be accessed without the pgtable installation, minor faults won't work indeed.
I think I see what you mean now. I agree, it isn't the end of the world if the kernel clears the page and then userspace overwrites it.
The way I see it is:
@@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) if (WARN_ON_ONCE(folio_test_large(folio))) { ret = VM_FAULT_SIGBUS; goto out_folio; }
if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); kvm_gmem_mark_prepared(folio); }
if (userfaultfd_minor(vmf->vma)) {
folio_unlock(folio);
filemap_invalidate_unlock_shared(inode->i_mapping);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
I suppose you meant s/MISSING/MINOR/.
Yes, that's what I meant, thank you.
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio: if (ret != VM_FAULT_LOCKED) { folio_unlock(folio); folio_put(folio); }
On the first fault (cache miss), the kernel will allocate/add/clear the page (as there is no MISSING trap now), and once the page is in the cache, a MINOR event will be sent for userspace to copy its content. Please let me know if this is an acceptable semantics.
Since userspace is getting notified after KVM calls kvm_gmem_mark_prepared(), which removes the page from the direct map [1], userspace can't use write() to populate the content because write() relies on direct map [2]. However userspace can do a plain memcpy that would use user pagetables instead. This forces userspace to respond to stage-2 and VMA faults in guest_memfd differently, via write() and memcpy respectively. It doesn't seem like a significant problem though.
It looks ok in general, but could you remind me why you need to stick with write() syscall?
IOW, if gmemfd will always need mmap() and it's fully accessible from userspace in your use case, wouldn't mmap()+memcpy() always work already, and always better than write()?
Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is an optimisation. Most of the pages in guest_memfd are only ever accessed by the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need userspace pagetables set up. By using write() we can avoid VMA faults, installing corresponding PTEs and double page initialisation we discussed earlier. The optimised path only contains pagecache population via write(). Even TDP faults can be avoided if using KVM prefaulting API [1].
[1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory
Thanks,
I believe, with this approach the original race condition is gone because UFFD messages are only sent on cache hit and it is up to userspace to serialise writes. Please correct me if I'm wrong here.
[1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#... [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf...
as long as the content can only be accessed from the pgtable (either via mmap() or GUP on top of it), then afaiu it could work similarly like MISSING faults, because anything trying to access it will be trapped.
[1]
-- Peter Xu
-- Peter Xu
On Thu, Mar 13, 2025 at 10:13:23PM +0000, Nikita Kalyazin wrote:
Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is an optimisation. Most of the pages in guest_memfd are only ever accessed by the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need userspace pagetables set up. By using write() we can avoid VMA faults, installing corresponding PTEs and double page initialisation we discussed earlier. The optimised path only contains pagecache population via write(). Even TDP faults can be avoided if using KVM prefaulting API [1].
[1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory
Could you elaborate why VMA faults matters in perf?
If we're talking about postcopy-like migrations on top of KVM guest-memfd, IIUC the VMAs can be pre-faulted too just like the TDP pgtables, e.g. with MADV_POPULATE_WRITE.
Normally, AFAIU userapp optimizes IOs the other way round.. to change write()s into mmap()s, which at least avoids one round of copy.
For postcopy using minor traps (and since guest-memfd is always shared and non-private..), it's also possible to feed the mmap()ed VAs to NIC as buffers (e.g. in recvmsg(), for example, as part of iovec[]), and as long as the mmap()ed ranges are not registered by KVM memslots, there's no concern on non-atomic copy.
Thanks,
On 13/03/2025 22:38, Peter Xu wrote:
On Thu, Mar 13, 2025 at 10:13:23PM +0000, Nikita Kalyazin wrote:
Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is an optimisation. Most of the pages in guest_memfd are only ever accessed by the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need userspace pagetables set up. By using write() we can avoid VMA faults, installing corresponding PTEs and double page initialisation we discussed earlier. The optimised path only contains pagecache population via write(). Even TDP faults can be avoided if using KVM prefaulting API [1].
[1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory
Could you elaborate why VMA faults matters in perf?
Based on my experiments, I can populate 3GiB of guest_memfd with write() in 980 ms, while memcpy takes 2140 ms. When I was profiling it, I saw ~63% of memcpy time spent in the exception handler, which made me think VMA faults mattered.
If we're talking about postcopy-like migrations on top of KVM guest-memfd, IIUC the VMAs can be pre-faulted too just like the TDP pgtables, e.g. with MADV_POPULATE_WRITE.
Yes, I was thinking about MADV_POPULATE_WRITE as well, but AFAIK it isn't available in guest_memfd, at least with direct map removed due to [1] being updated in [2]:
diff --git a/mm/gup.c b/mm/gup.c index 3883b307780e..7ddaf93c5b6a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1283,7 +1283,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) return -EOPNOTSUPP;
- if (vma_is_secretmem(vma)) + if (vma_is_secretmem(vma) || vma_is_no_direct_map(vma)) return -EFAULT;
if (write) {
[1] https://elixir.bootlin.com/linux/v6.13.6/source/mm/gup.c#L1286 [2] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#...
Normally, AFAIU userapp optimizes IOs the other way round.. to change write()s into mmap()s, which at least avoids one round of copy.
For postcopy using minor traps (and since guest-memfd is always shared and non-private..), it's also possible to feed the mmap()ed VAs to NIC as buffers (e.g. in recvmsg(), for example, as part of iovec[]), and as long as the mmap()ed ranges are not registered by KVM memslots, there's no concern on non-atomic copy.
Yes, I see what you mean. It may be faster depending on the setup, if it's possible to remove one copy.
Anyway, it looks like the solution we discussed allows to choose between memcpy-only and memcpy/write-combined userspace implementations. I'm going to work on the next version of the series that would include MINOR trap and avoiding KVM dependency in mm via calling vm_ops->fault() in UFFDIO_CONTINUE.
Thanks,
-- Peter Xu
On Fri, Mar 14, 2025 at 05:12:35PM +0000, Nikita Kalyazin wrote:
Yes, I was thinking about MADV_POPULATE_WRITE as well, but AFAIK it isn't available in guest_memfd, at least with direct map removed due to [1] being updated in [2]:
I see, so GUP is no-go. IIUC the userapp can also prefault by writing zeros in a loop after mmap().
Thanks,
On Fri, Mar 14, 2025 at 05:12:35PM +0000, Nikita Kalyazin wrote:
Anyway, it looks like the solution we discussed allows to choose between memcpy-only and memcpy/write-combined userspace implementations. I'm going to work on the next version of the series that would include MINOR trap and avoiding KVM dependency in mm via calling vm_ops->fault() in UFFDIO_CONTINUE.
I'll attach some more context, not directly relevant to this series, but just FYI.
One thing I am not yet sure is whether ultimately we still need to register userfaultfd with another fd using offset ranges. The problem is whether there will be userfaultfd trapping demand on the pure private CoCo use case later. The only thing I'm not sure is if all guest-memfd use cases allow mmap(). If true, then maybe we can stick with the current UFFDIO_REGISTER on VA ranges.
In all cases, I think you can proceed with whatever you plan to do to add initial guest-memfd userfaultfd supports, as long as acceptable from KVM list.
The other thing is, what you're looking for indeed looks very close to what we may need. We want to have purely shared guest-memfd working just like vanilla memfd_create(), not only for 4K but for huge pages. We also want GUP working, so it can replace the old hugetlbfs use case.
I had a feeling that all the directions of guest-memfd recently happening on the list will ultimately need huge pages. It would be the same for you maybe, but only that your use case does not allow any permanant mapping that is visible to the kernel. Probably that's why GUP is forbidden but kmap isn't in your write()s; please bare with me if I made things wrong, I don't understand your use case well.
Just in case helpful, I have some PoC branches ready allowing 1G pages to be mapped to userspace.
https://github.com/xzpeter/linux/commits/peter-gmem-v0.2/
The work is based on Ackerley's 1G series, which contains most of the folio management part (but I fixed quite a few bugs in my tree; I believe Ackerley should have them fixed in his to-be-posted too). I also have a QEMU branch ready that can boot with it (I didn't yet test more things).
https://github.com/xzpeter/qemu/commits/peter-gmem-v0.2/
For example, besides guest-memfd alone, we definitely also need guest-memfd being trappable by userfaultfd, as what you are trying to do here, one way or another.
Thanks,
linux-kselftest-mirror@lists.linaro.org