This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd: - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` (as is fault support in general) - Only `page missing` event is currently supported - Userspace is supposed to respond to the event with the `write` syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting process. Note that we can't use `UFFDIO_COPY` here because userfaulfd code does not know how to prepare guest_memfd pages, eg remove them from direct map [4].
Not included in this series: - Proper interface for userfaultfd to recognise guest_memfd mappings - Proper handling of truncation cases after locking the page
Request for comments: - Is it a sensible workflow for guest_memfd to resolve a userfault `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One of the alternatives is teaching `UFFDIO_COPY` how to deal with guest_memfd pages. - What is a way forward to make userfaultfd code aware of guest_memfd? I saw that Patrick hit a somewhat similar problem in [5] when trying to use direct map manipulation functions in KVM and was pointed by David at Elliot's guestmem library [6] that might include a shim for that. Would the library be the right place to expose required interfaces like `vma_is_gmem`?
Nikita
[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/ [4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#... [5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@quic...
Nikita Kalyazin (5): KVM: guest_memfd: add kvm_gmem_vma_is_gmem KVM: guest_memfd: add support for uffd missing mm: userfaultfd: allow to register userfaultfd for guest_memfd mm: userfaultfd: support continue for guest_memfd KVM: selftests: add uffd missing test for guest_memfd
include/linux/userfaultfd_k.h | 9 ++ mm/userfaultfd.c | 23 ++++- .../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++ virt/kvm/guest_memfd.c | 17 +++- virt/kvm/kvm_mm.h | 1 + 5 files changed, 136 insertions(+), 2 deletions(-)
base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223
It will be used to distinguish the vma type in userfaultfd code. This likely needs to be done in the guestmem library.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- virt/kvm/guest_memfd.c | 5 +++++ virt/kvm/kvm_mm.h | 1 + 2 files changed, 6 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index f93fe5835173..af825f7494ea 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -500,6 +500,11 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf, return ret && start == (*offset >> PAGE_SHIFT) ? ret : *offset - (start << PAGE_SHIFT); } + +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma) +{ + return vma->vm_ops == &kvm_gmem_vm_ops; +} #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
static struct file_operations kvm_gmem_fops = { diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index acef3f5c582a..09fcdf18a072 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -73,6 +73,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); #else static inline void kvm_gmem_init(struct module *module) {
Add support for sending a pagefault event if userfaultfd is registered. Only page missing event is currently supported.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- virt/kvm/guest_memfd.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index af825f7494ea..358c3776ed66 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,9 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#ifdef CONFIG_KVM_PRIVATE_MEM +#include <linux/userfaultfd_k.h> +#endif /* CONFIG_KVM_PRIVATE_MEM */
#include "kvm_mm.h"
@@ -332,9 +335,16 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) struct folio *folio; vm_fault_t ret = VM_FAULT_LOCKED;
+ folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); + if (!folio && userfaultfd_missing(vmf->vma)) + return handle_userfault(vmf, VM_UFFD_MISSING); + if (folio) + folio_lock(folio); + filemap_invalidate_lock_shared(inode->i_mapping);
- folio = kvm_gmem_get_folio(inode, vmf->pgoff); + if (!folio) + folio = kvm_gmem_get_folio(inode, vmf->pgoff); if (IS_ERR(folio)) { switch (PTR_ERR(folio)) { case -EAGAIN:
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- include/linux/userfaultfd_k.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75342022d144..440d38903359 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -20,6 +20,10 @@ #include <asm-generic/pgtable_uffd.h> #include <linux/hugetlb_inline.h>
+#ifdef CONFIG_KVM_PRIVATE_MEM +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); +#endif + /* The set of all possible UFFD-related VM flags. */ #define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -242,6 +246,11 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, return false; #endif
+#ifdef CONFIG_KVM_PRIVATE_MEM + if (kvm_gmem_vma_is_gmem(vma)) + return true; +#endif + /* By default, allow any of anon|shmem|hugetlb */ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
When userspace receives a page missing event, it is supposed to populate the missing page in guest_memfd pagecache via the write syscall and unblock the faulting process via UFFDIO_CONTINUE.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- mm/userfaultfd.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index af3dfc3633db..aaff66a7f15b 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -19,6 +19,10 @@ #include <asm/tlb.h> #include "internal.h"
+#ifdef CONFIG_KVM_PRIVATE_MEM +bool kvm_gmem_vma_is_gmem(struct vm_area_struct *vma); +#endif + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -391,6 +395,16 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct page *page; int ret;
+#ifdef CONFIG_KVM_PRIVATE_MEM + if (kvm_gmem_vma_is_gmem(dst_vma)) { + ret = 0; + folio = filemap_get_entry(inode->i_mapping, pgoff); + if (IS_ERR(folio)) + ret = PTR_ERR(folio); + else + folio_lock(folio); + } else +#endif ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) @@ -769,9 +783,16 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma) +#ifdef CONFIG_KVM_PRIVATE_MEM + && !kvm_gmem_vma_is_gmem(dst_vma) +#endif + ) goto out_unlock; if (!vma_is_shmem(dst_vma) && +#ifdef CONFIG_KVM_PRIVATE_MEM + !kvm_gmem_vma_is_gmem(dst_vma) && +#endif uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock;
The test demonstrates how a page missing event can be resolved via write syscall followed by UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- .../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++ 1 file changed, 88 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index b07221aa54c9..aea0e8627981 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -10,12 +10,16 @@ #include <errno.h> #include <stdio.h> #include <fcntl.h> +#include <pthread.h>
#include <linux/bitmap.h> #include <linux/falloc.h> +#include <linux/userfaultfd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/ioctl.h>
#include "kvm_util.h" #include "test_util.h" @@ -278,6 +282,88 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm) close(fd1); }
+struct fault_args { + char *addr; + volatile char value; +}; + +static void *fault_thread_fn(void *arg) +{ + struct fault_args *args = arg; + + /* Trigger page fault */ + args->value = *args->addr; + return NULL; +} + +static void test_uffd_missing(int fd, size_t page_size, size_t total_size) +{ + struct uffdio_register uffd_reg; + struct uffdio_continue uffd_cont; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *buf = NULL; + int uffd, ret; + off_t offset = page_size; + void *fault_addr; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed"); + + struct uffdio_api uffdio_api = { + .api = UFFD_API, + .features = UFFD_FEATURE_MISSING_SHMEM, + }; + ret = ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed"); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed"); + + uffd_reg.range.start = (unsigned long)mem; + uffd_reg.range.len = total_size; + uffd_reg.mode = UFFDIO_REGISTER_MODE_MISSING; + ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, + offset, page_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + fault_addr = mem + offset; + args.addr = fault_addr; + + ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret == 0, "pthread_create should succeed"); + + ret = read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret != -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr, + "pagefault should occur at expected address"); + + ret = pwrite(fd, buf + offset, page_size, offset); + TEST_ASSERT(ret == page_size, "write should succeed"); + + uffd_cont.range.start = (unsigned long)fault_addr; + uffd_cont.range.len = page_size; + uffd_cont.mode = 0; + ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed"); + + ret = pthread_join(fault_thread, NULL); + TEST_ASSERT(ret == 0, "pthread_join should succeed"); + + ret = munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + unsigned long get_shared_type(void) { #ifdef __x86_64__ @@ -316,6 +402,8 @@ void test_vm_type(unsigned long type, bool is_shared) test_file_size(fd, page_size, total_size); test_fallocate(fd, page_size, total_size); test_invalid_punch_hole(fd, page_size, total_size); + if (is_shared) + test_uffd_missing(fd, page_size, total_size);
close(fd); kvm_vm_release(vm);
On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` (as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write` syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting process. Note that we can't use `UFFDIO_COPY` here because userfaulfd code does not know how to prepare guest_memfd pages, eg remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One of the alternatives is teaching `UFFDIO_COPY` how to deal with guest_memfd pages.
Probably not.. I don't see what protects a thread fault concurrently during write() happening, seeing partial data. Since you check the page cache it'll let it pass, but the partial page will be faulted in there.
I think we may need to either go with full MISSING or full MINOR traps.
One thing to mention is we probably need MINOR sooner or later to support gmem huge pages. The thing is for huge folios in gmem we can't rely on missing in page cache, as we always need to allocate in hugetlb sizes.
- What is a way forward to make userfaultfd code aware of guest_memfd? I saw that Patrick hit a somewhat similar problem in [5] when trying to use direct map manipulation functions in KVM and was pointed by David at Elliot's guestmem library [6] that might include a shim for that. Would the library be the right place to expose required interfaces like `vma_is_gmem`?
Not sure what's the best to do, but IIUC the current way this series uses may not work as long as one tries to reference a kvm symbol from core mm..
One trick I used so far is leveraging vm_ops and provide hook function to report specialties when it's gmem. In general, I did not yet dare to overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to be accepted. E.g. something like this:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 5e742738240c..b068bb79fdbc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -653,8 +653,26 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + /* + * When set, return the allowed orders bitmask in faults of mmap() + * ranges (e.g. for follow up huge_fault() processing). Drivers + * can use this to bypass THP setups for specific types of VMAs. + */ + unsigned long (*get_supported_orders)(struct vm_area_struct *vma); };
+static inline bool vma_has_supported_orders(struct vm_area_struct *vma) +{ + return vma->vm_ops && vma->vm_ops->get_supported_orders; +} + +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma) +{ + if (!vma_has_supported_orders(vma)) + return 0; + return vma->vm_ops->get_supported_orders(vma); +} +
In my case I used that to allow gmem report huge page supports on faults.
Said that, above only existed in my own tree so far, so I also don't know whether something like that could be accepted (even if it'll work for you).
Thanks,
linux-kselftest-mirror@lists.linaro.org