On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 28, 2019 12:25 am:
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Yeah well if mm/ does this then sure it's dead and can go.
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.