On Thu, Mar 13, 2025 at 10:13:23PM +0000, Nikita Kalyazin wrote:
Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is an optimisation. Most of the pages in guest_memfd are only ever accessed by the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need userspace pagetables set up. By using write() we can avoid VMA faults, installing corresponding PTEs and double page initialisation we discussed earlier. The optimised path only contains pagecache population via write(). Even TDP faults can be avoided if using KVM prefaulting API [1].
[1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory
Could you elaborate why VMA faults matters in perf?
If we're talking about postcopy-like migrations on top of KVM guest-memfd, IIUC the VMAs can be pre-faulted too just like the TDP pgtables, e.g. with MADV_POPULATE_WRITE.
Normally, AFAIU userapp optimizes IOs the other way round.. to change write()s into mmap()s, which at least avoids one round of copy.
For postcopy using minor traps (and since guest-memfd is always shared and non-private..), it's also possible to feed the mmap()ed VAs to NIC as buffers (e.g. in recvmsg(), for example, as part of iovec[]), and as long as the mmap()ed ranges are not registered by KVM memslots, there's no concern on non-atomic copy.
Thanks,