On 17.12.21 20:22, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
mapcount is insane, and I think this is making this worse again.
Maybe I'm misreading this, but afaik
get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
fork()
unmap in the original
child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.
Hi Linus,
This is just GUP before fork(), which is in general problematic/incompatible with sharing. What we're concerned about in the context of this series (see the security issue) is GUP after fork(). And we're not changing GUP before fork() or even the COW logic in the context of this series.
I agree that GUP before fork() has to be handled differently: during fork(): don't share something that cannot possibly be shared in a safe way. Don't allow COW semantics for something that is just broken with COW.
Notice? "mapcount" is complete BS. The number of times a page is mapped is irrelevant for COW. All that matters is that we get an exclusive access to the page before we can write to it.
We have to be very careful about the two sides of the story: GUP before fork and GUP after fork.
Anybody who takes mapcount into account at COW time is broken, and it worries me how this is all mixing up with the COW logic.
Now, maybe this "unshare" case is sufficiently different from COW that it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it doesn't happen for a real COW.
But honestly, for "unshare", I still don't see that the mapcount matters. What does "mapcount == 1" mean? Why is it meaningful?
I'll reply to your first mail in a sec.
GUP is the problem with COW, not ordinary processes mapping a page (mapcount), where you will only get new sharers during fork() -- in a very controlled way. So GUP has to take care to unshare *before* taking a reference, such that we can never reach the point of missed COW. GUP really is the problematic bit with it all.
Without GUP, we'd be living in a wonderful world in regards to COW.
Because if COW does things right, and always breaks a COW based on refcount, then what's the problem with taking a read-only ref to the page whether it is mapped multiple times or mapped just once? Anybody who already had write access to the page can write to it regardless, and any new writers go through COW and get a new page.
Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit:
1. mem = mmap(pagesize, MAP_PRIVATE) -> refcount == 1
2. memset(mem, 0, pagesize); /* Page is mapped R/W */
3. fork() /* Page gets mapped R/O */ -> refcount > 1
4. child quits -> refcount == 1
5. Take a R/O pin (RDMA, VFIO, ...) -> refcount > 1
6. memset(mem, 0xff, pagesize); -> Write fault -> COW
And GUP sees something different than our MM -- and this is perfectly valid, the R/O pin is just reading page content we might be modifying afterwrds. Take out 3. and 4. and it works as expected. This wouldn't happen when relying on the mapcount.
And 3+4 can really be anything that results in a R/O mapping of an anonymous page, even if it's just swapout followed by read fault that maps the page R/O.
I must be missing something realyl fundamental here, but to me it really reads like "mapcount can fundamentally never be relevant for COW, and if it's not relevant for COW, how can it be relevant for a read-only copy?"
It really is the right value to use. Only GUP is the problematic bit that has to trigger unsharing to not mess up COW logic later. Take GUP out of the equation and COW just works as expected with the mapcount -- as long as we can read an atomic value and synchronize against fork. (again, still composing the other mail :) )