On Wed, Jun 19, 2024, Fuad Tabba wrote:
Hi Jason,
On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
I agree that we need to agree :) This discussion has been going on since before LPC last year, and the consensus from the guest_memfd() folks (if I understood it correctly) is that guest_memfd() is what it is: designed for a specific type of confidential computing, in the style of TDX and CCA perhaps, and that it cannot (or will not) perform the role of being a general solution for all confidential computing.
That isn't remotely accurate. I have stated multiple times that I want guest_memfd to be a vehicle for all VM types, i.e. not just CoCo VMs, and most definitely not just TDX/SNP/CCA VMs.
What I am staunchly against is piling features onto guest_memfd that will cause it to eventually become virtually indistinguishable from any other file-based backing store. I.e. while I want to make guest_memfd usable for all VM *types*, making guest_memfd the preferred backing store for all *VMs* and use cases is very much a non-goal.
From an earlier conversation[1]:
: In other words, ditch the complexity for features that are well served by existing : general purpose solutions, so that guest_memfd can take on a bit of complexity to : serve use cases that are unique to KVM guests, without becoming an unmaintainble : mess due to cross-products.
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
Yeah, but every time someone does this and then links it to a uAPI it becomes utterly baked in concrete for the MM forever.
I agree. But if we can't modify guest_memfd() to fit our needs (pKVM, Gunyah), then we don't really have that many other options.
What _are_ your needs? There are multiple unanswered questions from our last conversation[2]. And by "needs" I don't mean "what changes do you want to make to guest_memfd?", I mean "what are the use cases, patterns, and scenarios that you want to support?".
: What's "hypervisor-assisted page migration"? More specifically, what's the : mechanism that drives it?
: Do you happen to have a list of exactly what you mean by "normal mm stuff"? I : am not at all opposed to supporting .mmap(), because long term I also want to : use guest_memfd for non-CoCo VMs. But I want to be very conservative with respect : to what is allowed for guest_memfd. E.g. host userspace can map guest_memfd, : and do operations that are directly related to its mapping, but that's about it.
That distinction matters, because as I have stated in that thread, I am not opposed to page migration itself:
: I am not opposed to page migration itself, what I am opposed to is adding deep : integration with core MM to do some of the fancy/complex things that lead to page : migration.
I am generally aware of the core pKVM use cases, but I AFAIK I haven't seen a complete picture of everything you want to do, and _why_.
E.g. if one of your requirements is that guest memory is managed by core-mm the same as all other memory in the system, then yeah, guest_memfd isn't for you. Integrating guest_memfd deeply into core-mm simply isn't realistic, at least not without *massive* changes to core-mm, as the whole point of guest_memfd is that it is guest-first memory, i.e. it is NOT memory that is managed by core-mm (primary MMU) and optionally mapped into KVM (secondary MMU).
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com