On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 22:30, Sean Christopherson wrote:
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
Sure, we can unmap the whole large folio (including all shared parts) before the conversion, just like we would do for migration. But we cannot detect that nobody pinned that subpage that we want to convert to private.
Core-mm is not, and will not, track pins per subpage.
So I only see two options:
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow GUP" route than I think it needs to disallow GUP, period. Like the whole "GUP writes to file-back memory" issue[*], which I think you're alluding to, short-term GUP is also problematic. But unlike file-backed memory, for TDX and SNP (and I think pKVM), a single rogue access has a high probability of being fatal to the entire system.
Disallowing short-term should work, in theory, because the
By "short-term", I assume you mean "long-term"? Or am I more lost than I realize?
writes-to-fileback has different issues (the PIN is not the problem but the dirtying).
It's more related us not allowing long-term pins for FSDAX pages, because the lifetime of these pages is determined by the FS.
What we would do is
- Unmap the large folio completely and make any refaults block.
-> No new pins can pop up
- If the folio is pinned, busy-wait until all the short-term pins are gone.
This is the step that concerns me. "Relatively short time" is, well, relative. Hmm, though I suppose if userspace managed to map a shared page into something that pins the page, and can't force an unpin, e.g. by stopping I/O?, then either there's a host userspace bug or a guest bug, and so effectively hanging the vCPU that is waiting for the conversion to complete is ok.
- Safely convert the relevant subpage from shared -> private
Not saying it's the best approach, but it should be doable.