Thanks for doing the dirty work!
On Fri, Jul 12, 2024, Ackerley Tng wrote:
Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am PDT:
The current direction is:
- Allow mmap() of ranges that cover both shared and private memory, but disallow faulting in of private pages
- On access to private pages, userspace will get some error, perhaps SIGBUS
- On shared to private conversions, unmap the page and decrease refcounts
Note, I would strike the "decrease refcounts" part, as putting references is a natural consequence of unmapping memory, not an explicit action guest_memfd will take when converting from shared=>private.
And more importantly, guest_memfd will wait for the refcount to hit zero (or whatever the baseline refcount is).
- To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used.
- guest_memfd will track usage of (sub)pages, for both private and shared memory
- Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest)
FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that shatters pages at creation. I can see it being an intermediate step, e.g. to prove correctness and provide a bisection point, but shattering hugepages at creation would effectively make hugepage support useless.
I don't think we need to sort this out now though, as when the shattering (and potential reconstituion) occurs doesn't affect the overall direction in any way (AFAIK). I'm chiming in purely to stave off complaints that this would break hugepage support :-)
+ Core MM infrastructure will still be used to track page table mappings in mapcounts and other references (refcounts) per subpage + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to be optimized later. Suggestions: + Use a tracking data structure other than struct page + Remove the memory for struct pages backing private memory from the vmemmap, and re-populate the vmemmap on conversion from private to shared
Implementation pointers for huge page support
- Consensus was that getting core MM to do tracking seems wrong
- Maintaining special page refcounts for guest_memfd pages is difficult to get working and requires weird special casing in many places. This was tried for FS DAX pages and did not work out: [1]
Implementation suggestion: use infrastructure similar to what ZONE_DEVICE uses, to provide the huge page to interested parties
- TBD: how to actually get huge pages into guest_memfd
- TBD: how to provide/convert the huge pages to ZONE_DEVICE
- Perhaps reserve them at boot time like in HugeTLB
Line of sight to compaction/migration:
- Compaction here means making memory contiguous
- Compaction/migration scope:
- In scope for 4K pages
- Out of scope for 1G pages and anything managed through ZONE_DEVICE
- Out of scope for an initial implementation
- Ideas for future implementations
- Reuse the non-LRU page migration framework as used by memory balloning
- Have userspace drive compaction/migration via ioctls
- Having line of sight to optimizing lost HVO means avoiding being locked in to any implementation requiring struct pages
- Without struct pages, it is hard to reuse core MM’s compaction/migration infrastructure
Discuss more details at LPC in Sep 2024, such as how to use huge pages, shared/private conversion, huge page splitting
This addresses the prerequisites set out by Fuad and Elliott at the beginning of the session, which were:
- Non-destructive shared/private conversion
- Through having guest_memfd manage and track both shared/private memory
- Huge page support with the option of converting individual subpages
- Splitting of pages will be managed by guest_memfd
- Line of sight to compaction/migration of private memory
- Possibly driven by userspace using guest_memfd ioctls
- Loading binaries into guest (private) memory before VM starts
- This was identified as a special case of (1.) above
- Non-protected guests in pKVM
- Not discussed during session, but this is a goal of guest_memfd, for all VM types [2]
David Hildenbrand summarized this during the meeting at t=47m25s [3].