On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
mapcount is insane, and I think this is making this worse again.
Maybe I'm misreading this, but afaik
- get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
- fork()
- unmap in the original
- child now has "mapcount == 1" on a page again, but refcount is elevated, and child HAS TO COW before writing.
Notice? "mapcount" is complete BS. The number of times a page is mapped is irrelevant for COW. All that matters is that we get an exclusive access to the page before we can write to it.
Anybody who takes mapcount into account at COW time is broken, and it worries me how this is all mixing up with the COW logic.
Now, maybe this "unshare" case is sufficiently different from COW that it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it doesn't happen for a real COW.
But honestly, for "unshare", I still don't see that the mapcount matters. What does "mapcount == 1" mean? Why is it meaningful?
Because if COW does things right, and always breaks a COW based on refcount, then what's the problem with taking a read-only ref to the page whether it is mapped multiple times or mapped just once? Anybody who already had write access to the page can write to it regardless, and any new writers go through COW and get a new page.
I must be missing something realyl fundamental here, but to me it really reads like "mapcount can fundamentally never be relevant for COW, and if it's not relevant for COW, how can it be relevant for a read-only copy?"
Linus