On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit namit@vmware.com wrote:
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
Ok, I understand why that ends up being the choice, but it makes it too ugly and messy to look up to be worth it, I think.
I still don’t know what exactly you have in mind for making use out of it for the COW issue.
So the truly fundamental question for COW (and for a long-term GUP) is fairly simple:
- Is the page I have truly owned exclusively by this VM?
If that _isn't_ the case, you absolutely have to COW.
If that _is_ the case, you can re-use the page.
That is really it, boiled down to the pure basics.
And if you aren't sure whether you are the ultimate and only authority over the page, then COW is the "safer" option, in that breaking sharing is fundamentally better than over-sharing.
Now, the reason I like "page_count()==1" is that it is a 100% certain way to know that you own the page absolutely and clearly.
There is no question what-so-ever about it.
And the reason I hate "page_mapcount()==1" with a passion is that it is NOTHING OF THE KIND. It is an entirely meaningless number. It doesn't mean anything at all.
Even if the page mapcount is exactly right, it could easily and trivially be a result of "fork, then unmap in either parent or child".
Now that page_mapcount() is unquestionably 1, but despite that, at some point the page was shared by another VM, and you can not know whether you really have exclusive access.
And that "even if page mapcount is exactly right" is a big issue in itself, as I hope I've explained.
It requires page locking, it requires that you take swapcache users into account, it is just a truly messy and messed up thing.
There really is absolutely no reason for page_mapcount to exist. It's a mistake. We have it for completely broken historical reasons.
It's WRONG.
Now, if "page_count()==1" is so great, what is the issue? Problem solved.
No, while page_count()==1 is one really fundamental marker (unlike the mapcount), it does have problems too.
Because yes, "page_count()==1" does mean that you have truly exclusive ownership of the page, but the reverse is not true.
The way the current regular VM code handles that "the reverse is not true" is by making "the page is writable" be the second way you can say "you clearly have full ownership of the page".
So that's why you then have the "maybe_pinned()" thing in fork() and in swap cache creation that keeps such a page writable, and doesn't do the virtual copy and make it read-only again.
But that's also why it has problems with write-protect (whether mprotect or uddf_wp).
Anyway, that was a long explanation to make the thinking clear, and finally come to the actual answer to your question:
Adding another bit in the page tables - *purely* to say "this VM owns the page outright" - would be fairly powerful. And fairly simple.
Then any COW event will set that bit - because when you actually COW, the page you install is *yours*. No questions asked.
And fork() would simply clear that bit (unless the page was one of the pinned pages that we simply copy).
See how simple that kind of concept is.
And please, see how INCREDIBLY BROKEN page_mapcount() is. It really fundamentally is pure and utter garbage. It in no way says "I have exclusive ownership of this page", because even if the mapcount is 1 *now*, it could have been something else earlier, and some other VM could have gotten a reference to it before the current VM did so.
This is why I will categoricall NAK any stupid attempt to re-introduce page_mapcount() for COW or GUP handling. It's unacceptably fundamentally broken.
Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc".
And no, we can't use "pincount" either, because it's not exact. The fact that the page count is so elevated that we think it's pinned is a _heuristic_, and that's ok when you have the opposite problem, and ask "*might* this page be pinned". You want to never get a false negative, but it can get a false positive.
Linus