On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
I think the point of Linus's plan really is that the new bit is derived from page_count, we get the set the new bit when we observe page_count==1 in various situations and we clear the new bit whenever we write protect with the intent to copy.
GUP doesn't interact with this bit. A writable page would still be the second way you can say "you clearly have full ownership of the page', so GUP just checks writability and bumps the refcount. The challenge of GUP reamins to sanely sequence it with things that are doing WP.
The elevated GUP refcount prevents the page from getting the bit set again, regardless of what happens to it.
Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen.
Then, it is also the case that most often a read-only page will have this bit cleared, UFFWD WP being the exception.
UFFWD WP works fine as it will have the bit set in the cases we care about and COW will not happen.
If the bit is not set then everything works as is today and you get extra COWs. We still have to fix the things that are missing the extra COWs to check the page ref to fix this.
It seems this new bit is acting as a 'COW disable', so, the accuracy of COW vs GUP&speculative pagerefs now relies on setting the bit as aggressively as possible when it is safe and cheap to do so.
If I got it right this is why it is not just mapcount reduced to 1 bit. It is quite different, even though "this VM owns the page outright" sure sounds like "mapcount == 1"..
It seems like an interesting direction - the security properties seem good as we only have to take care of sites applying WP to decide what to do with the extra bit, and all places that set the bit to 1 do so after testing refcount under various locks preventing PTE WP.
That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if
- page is PTE mapped in the page table I'm walking
- page_count() == 1
Really means that "this is the only reference.". I do strongly believe so .. :)
AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :(
ZONE_DEVICE is also messed up, of course, but that is just more reasons ZONE_DEVICE refcounting needs fixing and you should ignore it.
Jason