On 19.12.21 18:44, Linus Torvalds wrote:
David, you said that you were working on some alternative model. Is it perhaps along these same lines below?
I was thinking that a bit in the page tables to say "this page is exclusive to this VM" would be a really simple thing to deal with for fork() and swapout and friends.
But we don't have such a bit in general, since many architectures have very limited sets of SW bits, and even when they exist we've spent them on things like UDDF_WP.,
But the more I think about the "bit doesn't even have to be in the page tables", the more I think maybe that's the solution.
A bit in the 'struct page' itself.
Exactly what I am prototyping right now.
For hugepages, you'd have to distribute said bit when you split the hugepage.
Yes, that's one tricky part ...
But other than that it looks quite simple: anybody who does a virtual copy will inevitably be messing with the page refcount, so clearing the "exclusive ownership" bit wouldn't be costly: the 'struct page' cacheline is already getting dirtied.
Or what was your model you were implying you were thinking about in your other email? You said
I'm playing with the idea of not setting the bit always during COW but only on GUP request to set the bit (either manually if possible or via FOLL_UNSHARE). That's a bit more tricky but allows for decoupling that approach completely from the page_pin() counter.
fork() is allowed to clear the bit if page_count() == 1 and share the page. So no GUP->no fork() performance changes (!) . Otherwise the bit can only vanish if we swapout/migrate the page: in which case there are no additional GUP/references on the page that rely on it!
The bit can be set directly if we have to copy the page in the fault handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
So the bit can "vanish" whenever there is no additional reference on the page. GUP syncs against fork() and can thereby set the bit/request to set the bit.
I'm trying to decouple it completely from the page_pin() counter to also be able to handle FOLL_GET (O_DIRECT reproducers unfortunately) correctly.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if * page is PTE mapped in the page table I'm walking * page_count() == 1 Really means that "this is the only reference.". I do strongly believe so .. :)