Yes, and I think we might have to revive that discussion, unfortunately. I started thinking about this, but did not reach a conclusion. Sharing my thoughts.
The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not just for private memory should be:
(1) Have private + shared parts backed by guest_memfd. Either the same, or a fd pair. (2) Allow to mmap only the "shared" parts. (3) Allow in-place conversion between "shared" and "private" parts.
These three were covered (modulo bugs) in the guest_memfd() RFC I'd sent a while back:
https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/
I remember there was a catch to it (either around mmap or pinning detection -- or around support for huge pages in the future; maybe these count as BUGs :) ).
I should probably go back and revisit the whole thing, I was only CCed on some part of it back then.
(4) Allow migration of the "shared" parts.
We would really like that too, if they allow us :)
A) Convert shared -> private?
- Must not be GUP-pinned
- Must not be mapped
- Must not reside on ZONE_MOVABLE/MIGRATE_CMA
- (must rule out any other problematic folio references that could read/write memory, might be feasible for guest_memfd)
B) Convert private -> shared?
- Nothing to consider
C) Map something?
- Must not be private
A,B and C were covered (again, modulo bugs) in the RFC.
For ordinary (small) pages, that might be feasible. (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not support them initially)
The real fun begins once we want to support huge pages/large folios and can end up having a mixture of "private" and "shared" per huge page. But really, that's what we want in the end I think.
I agree.
Unless we can teach the VM to not convert arbitrary physical memory ranges on a 4k basis to a mixture of private/shared ... but I've been told we don't want that. Hm.
There are two big problems with that that I can see:
- References/GUP-pins are per folio
What if some shared part of the folio is pinned but another shared part that we want to convert to private is not? Core-mm will not provide the answer to that: the folio maybe pinned, that's it. *Disallowing* at least long-term GUP-pins might be an option.
Right.
To get stuff into an IOMMU, maybe a per-fd interface could work, and guest_memfd would track itself which parts are currently "handed out", and with which "semantics" (shared vs. private).
[IOMMU + private parts might require that either way? Because, if we dissallow mmap, how should that ever work with an IOMMU otherwise].
Not sure if IOMMU + private makes that much sense really, but I think I might not really understand what you mean by this.
A device might be able to access private memory. In the TDX world, this would mean that a device "speaks" encrypted memory.
At the same time, a device might be able to access shared memory. Maybe devices can do both?
What do do when converting between private and shared? I think it depends on various factors (e.g., device capabilities).
[...]
I recall quite some details with memory renting or so on pKVM ... and I have to refresh my memory on that.
I really would like to get to a place where we could investigate and sort out all of these issues. It would be good to know though, what, in principle (and not due to any technical limitations), we might be allowed to do and expand guest_memfd() to do, and what out of principle is off the table.
As Jason said, maybe we need a revised model that can handle [...] private+shared properly.