On Wed 22-12-21 10:58:36, David Hildenbrand wrote:
On 22.12.21 09:51, David Hildenbrand wrote:
On 21.12.21 20:07, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
- is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then.
Yes. But really, I think it's the logical consequence of what Linus said [1]:
"And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test."
We must not FOLL_PIN a page that is not exclusive (not only on gup-fast, but really, on any gup). If we special case R/O FOLL_PIN, we cannot enable the sanity check on unpin as suggested by Linus [2]:
"If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set."
There are really only two feasible options I see when we want to take a R/O FOLL_PIN on a !PageAnonExclusive() anon page
(1) Fail the pinning completely. This implies that we'll have to fail O_DIRECT once converted to FOLL_PIN. (2) Request to mark the page PageAnonExclusive() via a FAULT_FLAG_UNSHARE and let it succeed.
Anything else would require additional accounting that we already discussed in the past is hard -- for example, to differentiate R/O from R/W pins requiring two pin counters.
The only impact would be that FOLL_PIN after fork() has to go via a FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would be nice to optimize this, to *not* do that, but again ... this would require even more counters I think, for example, to differentiate between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
So unless we discover a way to do additional accounting for ordinary 4k pages, I think we really can only do (1) or (2) to make sure we never ever pin a !PageAnonExclusive() page.
BTW, I just wondered if the optimization should actually be that R/O short-term FOLL_PIN users should actually be using FOLL_GET instead. So O_DIRECT with R/O would already be doing the right thing.
And it somewhat aligns with what we found: only R/W short-term FOLL_GET is problematic, where we can lose writes to the page from the device via O_DIRECT.
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users - mostly inside mm - who need this) but also page data. Eventually, we even wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Honza