On 21.12.21 02:03, Jason Gunthorpe wrote:
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
[in the meantime I figured out which pageflag we can reuse for anon pages, which is at least one step into the right direction]
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
In the pagefault handler we can then always reuse a PageAnonExclusive() page, because we know it's exclusive and it will stay exclusive because concurrent fork() isn't possible.
I think the point of Linus's plan really is that the new bit is derived from page_count, we get the set the new bit when we observe page_count==1 in various situations and we clear the new bit whenever we write protect with the intent to copy.
Here are is one problems I'm fighting with:
Assume we set the bit whenever we create a new anon page (either due to COW, ordinary fault, unsharing request, ..., even if it's mapped R/O first). We know the page is exclusive at that point because we created it and fork() could not happen yet.
fork() is the only code that can share the page between processes and turn it non-exclusive.
We can only clear the bit during fork() -- to turn the share page exclusive and map it R/O into both processes -- when we are sure that *nobody* concurrently takes a reference on the page the would be problematic (-> GUP).
So to clear the bit during fork, we have to (1) Check against page_count == 1 (2) Synchronize against GUP
(2) is easy using the mmap_lock and the mm->write_protect_seq
BUT, it would mean that whenever we fork() and there is one additional reference on a page (even if it's from the swapcache), we would slow down fork() even if there was never any GUP. This would apply to any process out there that does a fork() ...
So the idea is to mark a page only exclusive as soon as someone needs the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN or selected FOLL_GET like O_DIRECT). This can happen in my current approach using two ways:
(1) Set the bit when we know we are the only users
We can set PageAnonExclusive() in case *we sync against fork* and the page cannot get unmapped (pt lock) when: * The page is mapped writable * The page is mapped readable and page_count == 1
This should work during ordinary GUP in many cases.
If we cannot set the page exclusive, we have to trigger a page fault.
(2) During pagefaults when FOLL_FAULT_UNSHARE is set.
GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the bit failed. The page fault will then try once again to set the bit if there is a page mapped, and if that fails, do the COW/unshare and set the bit.
The above should work fairly reliable with GUP. But indeed, gup-fast-only is the problem. I'm still investigating what kind of lightweight synchronization we could do against fork() such that we wouldn't try setting a page PageAnonExclusive() while fork() concurrently shares the page.
We could eventually use the page lock and do a try_lock(), both in fork() and in gup-fast-only. fork() would only clear the bit if the try_lock() succeeded. gup-fast-only would only be able to set the bit and not fallback to the slow path if try_lock() succeeded.
But I'm still investigating if there are better alternatives ...
GUP doesn't interact with this bit. A writable page would still be the second way you can say "you clearly have full ownership of the page', so GUP just checks writability and bumps the refcount. The challenge of GUP reamins to sanely sequence it with things that are doing WP.
The elevated GUP refcount prevents the page from getting the bit set again, regardless of what happens to it.
Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen.
Then, it is also the case that most often a read-only page will have this bit cleared, UFFWD WP being the exception.
UFFWD WP works fine as it will have the bit set in the cases we care about and COW will not happen.
If the bit is not set then everything works as is today and you get extra COWs. We still have to fix the things that are missing the extra COWs to check the page ref to fix this.
It seems this new bit is acting as a 'COW disable', so, the accuracy of COW vs GUP&speculative pagerefs now relies on setting the bit as aggressively as possible when it is safe and cheap to do so.
But we really want to avoid degrading fork() for everybody that doesn't do heavy GUP ...
If I got it right this is why it is not just mapcount reduced to 1 bit. It is quite different, even though "this VM owns the page outright" sure sounds like "mapcount == 1"..
It seems like an interesting direction - the security properties seem good as we only have to take care of sites applying WP to decide what to do with the extra bit, and all places that set the bit to 1 do so after testing refcount under various locks preventing PTE WP.
That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if
- page is PTE mapped in the page table I'm walking
- page_count() == 1
Really means that "this is the only reference.". I do strongly believe so .. :)
AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :(
As we only care about anon pages, I think that doesn't apply. At least that's what I hope.