On 26.08.22 23:37, Peter Xu wrote:
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
On 26.08.22 17:55, Peter Xu wrote:
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
You could have a GUP-fast thread that just looked up the PTE and is going to pin the page afterwards, after the ptep_get_and_clear() returned. You'll have to wait until that thread finished.
Good that we're talking about it, very helpful! If that's actually not required -- good.
What I learned how GUP-fast and TLB flushes interact is the following:
GUP-fast disables local interrupts. A TLB flush will send an IPI and wait until it has been processed. This implies, that once the TLB flush succeeded, that the interrupt was handled and GUP-fast cannot be running anymore.
BUT, there is the new RCU variant nowadays, and the TLB flush itself should not actually perform such a sync. They merely protect the page tables from getting freed and the THP from getting split IIUC. And you're correct that that wouldn't help.
IIUC the early tlb flush won't protect concurrent fast-gup from happening, but I think it's safe because fast-gup will check pte after pinning, so either:
(1) fast-gup runs before ptep_get_and_clear(), then page_try_share_anon_rmap() will fail properly, or,
(2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup will see that either the pte is none or changed, then it'll fail the fast-gup itself.
I think you're right and I might have managed to confuse myself with the write_protect_page() comment. I placed the gup_must_unshare() check explicitly after the "pte changed" check for this reason. So once the PTE was cleared, GUP-fast would undo any pin performed on a now-stale PTE.
Another user that relies on this interaction between GUP-fast and TLB flushing is for example mm/ksm.c:write_protect_page()
There is a comment in there explaining the interaction a bit more detailed.
Maybe we'll be able to handle this differently in the future (maybe once this turns out to be an actual performance problem). Unfortunately, mm->write_protect_seq isn't easily usable because we'd need have to make sure we're the exclusive writer.
For now, it's not too complicated. For PTEs:
- try_to_migrate_one() already uses ptep_clear_flush().
- try_to_unmap_one() already conditionally used ptep_clear_flush().
- migrate_vma_collect_pmd() was the one case that didn't use it already
(and I wonder why it's different than try_to_migrate_one()).
I'm not sure whether I fully get the point, but here one major difference is all the rest handles one page, so a tlb flush alongside with the pte clear sounds reasonable. Even if so try_to_unmap_one() was modified to use tlb batching, but then I see that anon exclusive made that batching conditional. I also have question there on whether we can keep using the tlb batching even with anon exclusive pages there.
In general, I still don't see how stall tlb could affect anon exclusive pages on racing with fast-gup, because the only side effect of a stall tlb is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb, afaict..
I have the gut feeling that the comment in write_protect_page() is indeed stale, and that clearing PageAnonExclusive doesn't strictly need the TLB flush.
I'll try to refresh my memory if there was any other case that I had to handle over the weekend.
Thanks!