The patch titled Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush has been added to the -mm tree. Its filename is mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-mmu_gather-remove-__tlb_reset_ra... and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-mmu_gather-remove-__tlb_reset_ra...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
------------------------------------------------------ From: Yang Shi yang.shi@linux.alibaba.com Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below.
---8<---
static int map_size = 4096; static int num_iter = 500; static long threads_total;
static void *distant_area;
void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0;
for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); }
for (j = 0; j < map_size; j++) map_address[j] = 'b';
if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } }
return NULL; }
void *dummy(void *ptr) { return NULL; }
int main(void) { pthread_t thid[2];
/* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2;
while (1) { pthread_create(&thid[0], NULL, map_write_unmap, NULL); pthread_create(&thid[1], NULL, dummy, NULL);
pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<---
The program may bring in parallel execution like below:
t1 t2 munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1
pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range()
__tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained.
Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.a... Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Suggested-by: Will Deacon will.deacon@arm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nick Piggin npiggin@gmail.com Cc: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com Cc: Nadav Amit namit@vmware.com Cc: Minchan Kim minchan@kernel.org Cc: Mel Gorman mgorman@suse.de Cc: stable@vger.kernel.org [4.20+] Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
mm/mmu_gather.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-)
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB - * flush by batching, a thread has stable TLB entry can fail to flush - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB - * forcefully if we detect parallel PTE batching threads. + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB + * flush by batching, one thread may end up seeing inconsistent PTEs + * and result in having stale TLB entries. So flush TLB forcefully + * if we detect parallel PTE batching threads. + * + * However, some syscalls, e.g. munmap(), may free page tables, this + * needs force flush everything in the given range. Otherwise this + * may result in having stale TLB entries for some architectures, + * e.g. aarch64, that could specify flush what level TLB. */ if (mm_tlb_flush_nested(tlb->mm)) { + /* + * The aarch64 yields better performance with fullmm by + * avoiding multiple CPUs spamming TLBI messages at the + * same time. + * + * On x86 non-fullmm doesn't yield significant difference + * against fullmm. + */ + tlb->fullmm = 1; __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + tlb->freed_tables = 1; }
tlb_flush_mmu(tlb); _
Patches currently in -mm which might be from yang.shi@linux.alibaba.com are
mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
}tlb->freed_tables = 1;
tlb_flush_mmu(tlb);
Nick, Aneesh, can we now do this?
---
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d841369399f..8d28b83914cb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE) - } else if (mm_tlb_flush_nested(mm)) { - /* - * If there is a concurrent invalidation that is clearing ptes, - * then it's possible this invalidation will miss one of those - * cleared ptes and miss flushing the TLB. If this invalidate - * returns before the other one flushes TLBs, that can result - * in it returning while there are still valid TLBs inside the - * range to be invalidated. - * - * See mm/memory.c:tlb_finish_mmu() for more details. - * - * The solution to this is ensure the entire range is always - * flushed here. The problem for powerpc is that the flushes - * are page size specific, so this "forced flush" would not - * do the right thing if there are a mix of page sizes in - * the range to be invalidated. So use __flush_tlb_range - * which invalidates all possible page sizes in the range. - * - * PWC flush probably is not be required because the core code - * shouldn't free page tables in this path, but accounting - * for the possibility makes us a bit more robust. - * - * need_flush_all is an uncommon case because page table - * teardown should be done with exclusive locks held (but - * after locks are dropped another invalidate could come - * in), it could be optimized further if necessary. - */ - if (!tlb->need_flush_all) - __radix__flush_tlb_range(mm, start, end, true); - else - radix__flush_all_mm(mm); -#endif } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->need_flush_all) radix__flush_tlb_mm(mm);
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
}tlb->freed_tables = 1;
tlb_flush_mmu(tlb);
Nick, Aneesh, can we now do this?
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d841369399f..8d28b83914cb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
- } else if (mm_tlb_flush_nested(mm)) {
/*
* If there is a concurrent invalidation that is clearing ptes,
* then it's possible this invalidation will miss one of those
* cleared ptes and miss flushing the TLB. If this invalidate
* returns before the other one flushes TLBs, that can result
* in it returning while there are still valid TLBs inside the
* range to be invalidated.
*
* See mm/memory.c:tlb_finish_mmu() for more details.
*
* The solution to this is ensure the entire range is always
* flushed here. The problem for powerpc is that the flushes
* are page size specific, so this "forced flush" would not
* do the right thing if there are a mix of page sizes in
* the range to be invalidated. So use __flush_tlb_range
* which invalidates all possible page sizes in the range.
*
* PWC flush probably is not be required because the core code
* shouldn't free page tables in this path, but accounting
* for the possibility makes us a bit more robust.
*
* need_flush_all is an uncommon case because page table
* teardown should be done with exclusive locks held (but
* after locks are dropped another invalidate could come
* in), it could be optimized further if necessary.
*/
if (!tlb->need_flush_all)
__radix__flush_tlb_range(mm, start, end, true);
else
radix__flush_all_mm(mm);
-#endif } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->need_flush_all) radix__flush_tlb_mm(mm);
I guess we can revert most of the commit 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush multiple page sizes? . But should we evaluate the performance impact of that fullmm flush on ppc64?
-aneesh
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Nick, Aneesh, can we now do this?
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d841369399f..8d28b83914cb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
- } else if (mm_tlb_flush_nested(mm)) {
/*
* If there is a concurrent invalidation that is clearing ptes,
* then it's possible this invalidation will miss one of those
* cleared ptes and miss flushing the TLB. If this invalidate
* returns before the other one flushes TLBs, that can result
* in it returning while there are still valid TLBs inside the
* range to be invalidated.
*
* See mm/memory.c:tlb_finish_mmu() for more details.
*
* The solution to this is ensure the entire range is always
* flushed here. The problem for powerpc is that the flushes
* are page size specific, so this "forced flush" would not
* do the right thing if there are a mix of page sizes in
* the range to be invalidated. So use __flush_tlb_range
* which invalidates all possible page sizes in the range.
*
* PWC flush probably is not be required because the core code
* shouldn't free page tables in this path, but accounting
* for the possibility makes us a bit more robust.
*
* need_flush_all is an uncommon case because page table
* teardown should be done with exclusive locks held (but
* after locks are dropped another invalidate could come
* in), it could be optimized further if necessary.
*/
if (!tlb->need_flush_all)
__radix__flush_tlb_range(mm, start, end, true);
else
radix__flush_all_mm(mm);
-#endif } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->need_flush_all) radix__flush_tlb_mm(mm);
I guess we can revert most of the commit 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush multiple page sizes? . But should we evaluate the performance impact of that fullmm flush on ppc64?
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Also; as I noted last time: __radix___flush_tlb_range() and __radix__flush_tlb_range_psize() look similar enough that they might want to be a single function (and instead of @flush_all_sizes, have it take @gflush, @hflush, @flush and @pwc).
----- Original Message -----
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Nick, Aneesh, can we now do this?
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d841369399f..8d28b83914cb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
- } else if (mm_tlb_flush_nested(mm)) {
/*
* If there is a concurrent invalidation that is clearing ptes,
* then it's possible this invalidation will miss one of those
* cleared ptes and miss flushing the TLB. If this invalidate
* returns before the other one flushes TLBs, that can result
* in it returning while there are still valid TLBs inside the
* range to be invalidated.
*
* See mm/memory.c:tlb_finish_mmu() for more details.
*
* The solution to this is ensure the entire range is always
* flushed here. The problem for powerpc is that the flushes
* are page size specific, so this "forced flush" would not
* do the right thing if there are a mix of page sizes in
* the range to be invalidated. So use __flush_tlb_range
* which invalidates all possible page sizes in the range.
*
* PWC flush probably is not be required because the core code
* shouldn't free page tables in this path, but accounting
* for the possibility makes us a bit more robust.
*
* need_flush_all is an uncommon case because page table
* teardown should be done with exclusive locks held (but
* after locks are dropped another invalidate could come
* in), it could be optimized further if necessary.
*/
if (!tlb->need_flush_all)
__radix__flush_tlb_range(mm, start, end, true);
else
radix__flush_all_mm(mm);
-#endif } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->need_flush_all) radix__flush_tlb_mm(mm);
I guess we can revert most of the commit 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush multiple page sizes? . But should we evaluate the performance impact of that fullmm flush on ppc64?
Hi,
I ran a test on ppc64le, using [1] as benchmark (it's same one that produced aarch64 numbers in [2]):
16 threads, each looping 100k times over: mmap(16M) touch 1 page madvise(DONTNEED) munmap(16M)
10 runs averaged with and without v3 patch:
v5.2-rc2-24-gbec7550cca10 -------------------------- mean stddev real 37.382 2.780 user 1.420 0.078 sys 54.658 1.855
v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush" ----------------------------------------------------------------------------------------- mean stddev real 37.119 2.105 user 1.548 0.087 sys 55.698 1.357
Hardware --------- Power9 Model 8335-GTH (ibm,witherspoon,128CPUs,256GB RAM,6 nodes).
cpu : POWER9, altivec supported clock : 3300.000000MHz revision : 2.2 (pvr 004e 1202)
# numactl -H available: 6 nodes (0,8,252-255) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 0 size: 130722 MB node 0 free: 127327 MB node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 8 size: 130784 MB node 8 free: 129864 MB node 252 cpus: node 252 size: 0 MB node 252 free: 0 MB node 253 cpus: node 253 size: 0 MB node 253 free: 0 MB node 254 cpus: node 254 size: 0 MB node 254 free: 0 MB node 255 cpus: node 255 size: 0 MB node 255 free: 0 MB node distances: node 0 8 252 253 254 255 0: 10 40 80 80 80 80 8: 40 10 80 80 80 80 252: 80 80 10 80 80 80 253: 80 80 80 10 80 80 254: 80 80 80 80 10 80 255: 80 80 80 80 80 10
Regards, Jan
[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/... [2] https://lore.kernel.org/linux-mm/1158926942.23199905.1558020575293.JavaMail....
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Also; as I noted last time: __radix___flush_tlb_range() and __radix__flush_tlb_range_psize() look similar enough that they might want to be a single function (and instead of @flush_all_sizes, have it take @gflush, @hflush, @flush and @pwc).
Peter Zijlstra's on May 28, 2019 12:25 am:
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Nick, Aneesh, can we now do this?
Sorry I meant to get to that but forgot.
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d841369399f..8d28b83914cb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
- } else if (mm_tlb_flush_nested(mm)) {
/*
* If there is a concurrent invalidation that is clearing ptes,
* then it's possible this invalidation will miss one of those
* cleared ptes and miss flushing the TLB. If this invalidate
* returns before the other one flushes TLBs, that can result
* in it returning while there are still valid TLBs inside the
* range to be invalidated.
*
* See mm/memory.c:tlb_finish_mmu() for more details.
*
* The solution to this is ensure the entire range is always
* flushed here. The problem for powerpc is that the flushes
* are page size specific, so this "forced flush" would not
* do the right thing if there are a mix of page sizes in
* the range to be invalidated. So use __flush_tlb_range
* which invalidates all possible page sizes in the range.
*
* PWC flush probably is not be required because the core code
* shouldn't free page tables in this path, but accounting
* for the possibility makes us a bit more robust.
*
* need_flush_all is an uncommon case because page table
* teardown should be done with exclusive locks held (but
* after locks are dropped another invalidate could come
* in), it could be optimized further if necessary.
*/
if (!tlb->need_flush_all)
__radix__flush_tlb_range(mm, start, end, true);
else
radix__flush_all_mm(mm);
-#endif } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->need_flush_all) radix__flush_tlb_mm(mm);
I guess we can revert most of the commit 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush multiple page sizes? . But should we evaluate the performance impact of that fullmm flush on ppc64?
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Yeah well if mm/ does this then sure it's dead and can go.
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
Also; as I noted last time: __radix___flush_tlb_range() and __radix__flush_tlb_range_psize() look similar enough that they might want to be a single function (and instead of @flush_all_sizes, have it take @gflush, @hflush, @flush and @pwc).
Yeah, it could possibly use a cleanup pass. I have a few patches sitting around to make a few more optimisations around the place, was going to look at some refactoring after that.
Thanks, Nick
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 28, 2019 12:25 am:
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Yeah well if mm/ does this then sure it's dead and can go.
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Peter Zijlstra's on May 31, 2019 7:49 pm:
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 28, 2019 12:25 am:
On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
On 5/27/19 4:31 PM, Peter Zijlstra wrote:
On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush +++ a/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
* flush by batching, one thread may end up seeing inconsistent PTEs
* and result in having stale TLB entries. So flush TLB forcefully
* if we detect parallel PTE batching threads.
*
* However, some syscalls, e.g. munmap(), may free page tables, this
* needs force flush everything in the given range. Otherwise this
* may result in having stale TLB entries for some architectures,
*/ if (mm_tlb_flush_nested(tlb->mm)) {* e.g. aarch64, that could specify flush what level TLB.
/*
* The aarch64 yields better performance with fullmm by
* avoiding multiple CPUs spamming TLBI messages at the
* same time.
*
* On x86 non-fullmm doesn't yield significant difference
* against fullmm.
*/
__tlb_reset_range(tlb);tlb->fullmm = 1;
__tlb_adjust_range(tlb, start, end - start);
} tlb_flush_mmu(tlb);tlb->freed_tables = 1;
Maybe, but given the patch that went into -mm, PPC will never hit that branch I killed anymore -- and that really shouldn't be in architecture code anyway.
Yeah well if mm/ does this then sure it's dead and can go.
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
I mean -- concurrent freeing was an important fastpath, right? And concurrent freeing means that you hit this case. So this case itself should be important too.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
Page tables should not be concurrently freed I think. Just don't clear those page table free flags and it should be okay. Page sizes yes, but we accommodated for that in the arch code. I could see reason to add a flag to the gather struct like "concurrent_free" and set that from the generic code, which the arch has to take care of.
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Testing on 16MB unmap is possibly not a good benchmark, I didn't run it exactly but it looks likely to go beyond the range flush threshold and flush the entire PID anyway.
Thanks, Nick
Hi Nick,
On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 31, 2019 7:49 pm:
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
I mean -- concurrent freeing was an important fastpath, right? And concurrent freeing means that you hit this case. So this case itself should be important too.
I honestly don't think we (Peter and I) know. Our first involvement with this was because TLBs were found to contain stale user entries:
https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.J...
so the initial work to support the concurrent freeing was done separately and, I assume, motivated by some real workloads. I would also very much like to know more about that, since nothing remotely realistic has surfaced in this discussion, other than some historical glibc thing which has long since been fixed.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
Page tables should not be concurrently freed I think. Just don't clear those page table free flags and it should be okay. Page sizes yes, but we accommodated for that in the arch code. I could see reason to add a flag to the gather struct like "concurrent_free" and set that from the generic code, which the arch has to take care of.
I think you're correct that two CPUs cannot free the page tables concurrently (I misunderstood this initially), although I also think there may be some subtle issues if tlb->freed_tables is not set, depending on the architecture. Roughly speaking, if one CPU is clearing a PMD as part of munmap() and another CPU in madvise() does only last-level TLB invalidation, then I think there's the potential for the invalidation to be ineffective if observing a cleared PMD doesn't imply that the last level has been unmapped from the perspective of the page-table walker.
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Testing on 16MB unmap is possibly not a good benchmark, I didn't run it exactly but it looks likely to go beyond the range flush threshold and flush the entire PID anyway.
If we can get a better idea of what a "good benchmark" might look like (i.e. something that is representative of the cases in which real workloads are likely to trigger this path) then we can definitely try to optimise around that.
In the meantime, I would really like to see this patch land in mainline since it fixes a regression.
Will
Will Deacon's on June 3, 2019 8:30 pm:
Hi Nick,
On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 31, 2019 7:49 pm:
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
I mean -- concurrent freeing was an important fastpath, right? And concurrent freeing means that you hit this case. So this case itself should be important too.
I honestly don't think we (Peter and I) know. Our first involvement with this was because TLBs were found to contain stale user entries:
https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.J...
so the initial work to support the concurrent freeing was done separately and, I assume, motivated by some real workloads. I would also very much like to know more about that, since nothing remotely realistic has surfaced in this discussion, other than some historical glibc thing which has long since been fixed.
Well, it seems like it is important. While the complexity is carried in the mm, we should not skimp on this last small piece.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
Page tables should not be concurrently freed I think. Just don't clear those page table free flags and it should be okay. Page sizes yes, but we accommodated for that in the arch code. I could see reason to add a flag to the gather struct like "concurrent_free" and set that from the generic code, which the arch has to take care of.
I think you're correct that two CPUs cannot free the page tables concurrently (I misunderstood this initially), although I also think there may be some subtle issues if tlb->freed_tables is not set, depending on the architecture. Roughly speaking, if one CPU is clearing a PMD as part of munmap() and another CPU in madvise() does only last-level TLB invalidation, then I think there's the potential for the invalidation to be ineffective if observing a cleared PMD doesn't imply that the last level has been unmapped from the perspective of the page-table walker.
That should not be the case because the last level table should have had all entries cleared before the pointer to it has been cleared.
So the page table walker could begin from the now-freed page table, but it would never instantiate a valid TLB entry from there. So a TLB invalidation would behave properly despite not flushing page tables.
Powerpc at least would want to avoid over flushing here, AFAIKS.
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Testing on 16MB unmap is possibly not a good benchmark, I didn't run it exactly but it looks likely to go beyond the range flush threshold and flush the entire PID anyway.
If we can get a better idea of what a "good benchmark" might look like (i.e. something that is representative of the cases in which real workloads are likely to trigger this path) then we can definitely try to optimise around that.
Hard to say unfortunately. A smaller unmap range to start with, but even then when you have a TLB over-flushing case, then an unmap micro benchmark is not a great test because you'd like to see more impact of other useful entries being flushed (e.g., you need an actual working set).
In the meantime, I would really like to see this patch land in mainline since it fixes a regression.
Sorry I didn't provide input earlier. I would like to improve the fix or at least make an option for archs to provide an optimised way to flush this case, so it would be nice not to fix archs this way and then have to change the fix significantly right away.
But the bug does need to be fixed of course, if there needs to be more thought about it maybe it's best to take this fix for next release.
Thanks, Nick
Hi Nick,
On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
Will Deacon's on June 3, 2019 8:30 pm:
On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 31, 2019 7:49 pm:
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
I mean -- concurrent freeing was an important fastpath, right? And concurrent freeing means that you hit this case. So this case itself should be important too.
I honestly don't think we (Peter and I) know. Our first involvement with this was because TLBs were found to contain stale user entries:
https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.J...
so the initial work to support the concurrent freeing was done separately and, I assume, motivated by some real workloads. I would also very much like to know more about that, since nothing remotely realistic has surfaced in this discussion, other than some historical glibc thing which has long since been fixed.
Well, it seems like it is important. While the complexity is carried in the mm, we should not skimp on this last small piece.
As I say, I really don't know. But yes, if we can do something better we should.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
Page tables should not be concurrently freed I think. Just don't clear those page table free flags and it should be okay. Page sizes yes, but we accommodated for that in the arch code. I could see reason to add a flag to the gather struct like "concurrent_free" and set that from the generic code, which the arch has to take care of.
I think you're correct that two CPUs cannot free the page tables concurrently (I misunderstood this initially), although I also think there may be some subtle issues if tlb->freed_tables is not set, depending on the architecture. Roughly speaking, if one CPU is clearing a PMD as part of munmap() and another CPU in madvise() does only last-level TLB invalidation, then I think there's the potential for the invalidation to be ineffective if observing a cleared PMD doesn't imply that the last level has been unmapped from the perspective of the page-table walker.
That should not be the case because the last level table should have had all entries cleared before the pointer to it has been cleared.
The devil is in the detail here, and I think specifically it depends what you mean by "before". Does that mean memory barrier, or special page-table walker barrier, or TLB invalidation or ...?
So the page table walker could begin from the now-freed page table, but it would never instantiate a valid TLB entry from there. So a TLB invalidation would behave properly despite not flushing page tables.
Powerpc at least would want to avoid over flushing here, AFAIKS.
For arm64 it really depends how often this hits. Simply not setting tlb->freed_tables would also break things for us, because we have an optimisation where we elide invalidation in the fullmm && !freed_tables case, since this is indicative of the mm going away and so we simply avoid reallocating its ASID.
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Testing on 16MB unmap is possibly not a good benchmark, I didn't run it exactly but it looks likely to go beyond the range flush threshold and flush the entire PID anyway.
If we can get a better idea of what a "good benchmark" might look like (i.e. something that is representative of the cases in which real workloads are likely to trigger this path) then we can definitely try to optimise around that.
Hard to say unfortunately. A smaller unmap range to start with, but even then when you have a TLB over-flushing case, then an unmap micro benchmark is not a great test because you'd like to see more impact of other useful entries being flushed (e.g., you need an actual working set).
Right, sounds like somebody needs to do some better analysis than what's been done so far.
In the meantime, I would really like to see this patch land in mainline since it fixes a regression.
Sorry I didn't provide input earlier. I would like to improve the fix or at least make an option for archs to provide an optimised way to flush this case, so it would be nice not to fix archs this way and then have to change the fix significantly right away.
Please send patches ;)
But the bug does need to be fixed of course, if there needs to be more thought about it maybe it's best to take this fix for next release.
Agreed.
Will
Will Deacon's on June 4, 2019 3:57 am:
Hi Nick,
On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
Will Deacon's on June 3, 2019 8:30 pm:
On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
Peter Zijlstra's on May 31, 2019 7:49 pm:
On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
I don't think it's very nice to set fullmm and freed_tables for this case though. Is this concurrent zapping an important fast path? It must have been, in order to justify all this complexity to the mm, so we don't want to tie this boat anchor to it AFAIKS?
I'm not convinced its an important fast path, afaict it is an unfortunate correctness issue caused by allowing concurrenct frees.
I mean -- concurrent freeing was an important fastpath, right? And concurrent freeing means that you hit this case. So this case itself should be important too.
I honestly don't think we (Peter and I) know. Our first involvement with this was because TLBs were found to contain stale user entries:
https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.J...
so the initial work to support the concurrent freeing was done separately and, I assume, motivated by some real workloads. I would also very much like to know more about that, since nothing remotely realistic has surfaced in this discussion, other than some historical glibc thing which has long since been fixed.
Well, it seems like it is important. While the complexity is carried in the mm, we should not skimp on this last small piece.
As I say, I really don't know. But yes, if we can do something better we should.
Is the problem just that the freed page tables flags get cleared by __tlb_reset_range()? Why not just remove that then, so the bits are set properly for the munmap?
That's insufficient; as argued in my initial suggestion:
https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass....
Since we don't know what was flushed by the concorrent flushes, we must flush all state (page sizes, tables etc..).
Page tables should not be concurrently freed I think. Just don't clear those page table free flags and it should be okay. Page sizes yes, but we accommodated for that in the arch code. I could see reason to add a flag to the gather struct like "concurrent_free" and set that from the generic code, which the arch has to take care of.
I think you're correct that two CPUs cannot free the page tables concurrently (I misunderstood this initially), although I also think there may be some subtle issues if tlb->freed_tables is not set, depending on the architecture. Roughly speaking, if one CPU is clearing a PMD as part of munmap() and another CPU in madvise() does only last-level TLB invalidation, then I think there's the potential for the invalidation to be ineffective if observing a cleared PMD doesn't imply that the last level has been unmapped from the perspective of the page-table walker.
That should not be the case because the last level table should have had all entries cleared before the pointer to it has been cleared.
The devil is in the detail here, and I think specifically it depends what you mean by "before". Does that mean memory barrier, or special page-table walker barrier, or TLB invalidation or ...?
I don't know that matters. It is a complicating factor, but would not be a new one to page table freeing. The issue really is the TLB entries (not page walk entry) which need to be flushed by the concurrent unmaps. Even without any page table freeing activity (so you would still have the page table connected), the ordering and flushing needs to be right such that it can not re-instantiate a new TLB from the page table with the old PTEs.
If that is solved, then the subsequent step of freeing the page table page doesn't introduce a new window where old PTEs could be read from the same table via page walk cache after it is disconnected.
So the page table walker could begin from the now-freed page table, but it would never instantiate a valid TLB entry from there. So a TLB invalidation would behave properly despite not flushing page tables.
Powerpc at least would want to avoid over flushing here, AFAIKS.
For arm64 it really depends how often this hits. Simply not setting tlb->freed_tables would also break things for us, because we have an optimisation where we elide invalidation in the fullmm && !freed_tables case, since this is indicative of the mm going away and so we simply avoid reallocating its ASID.
It wouldn't be not setting it, but rather not clearing it.
Not sure you have to worry about concurrent unmaps in the fullmm case either.
But it looks like benchmarks (for the one test-case we have) seem to favour flushing the world over flushing a smaller range.
Testing on 16MB unmap is possibly not a good benchmark, I didn't run it exactly but it looks likely to go beyond the range flush threshold and flush the entire PID anyway.
If we can get a better idea of what a "good benchmark" might look like (i.e. something that is representative of the cases in which real workloads are likely to trigger this path) then we can definitely try to optimise around that.
Hard to say unfortunately. A smaller unmap range to start with, but even then when you have a TLB over-flushing case, then an unmap micro benchmark is not a great test because you'd like to see more impact of other useful entries being flushed (e.g., you need an actual working set).
Right, sounds like somebody needs to do some better analysis than what's been done so far.
In the meantime, I would really like to see this patch land in mainline since it fixes a regression.
Sorry I didn't provide input earlier. I would like to improve the fix or at least make an option for archs to provide an optimised way to flush this case, so it would be nice not to fix archs this way and then have to change the fix significantly right away.
Please send patches ;)
I have a few things lying around I put on hold until after all the recent nice refactoring and improvements. I will rebase and try to look at this issue as well and see if there is any ways to improve. Likely to need help with arch code and analysis of races.
But the bug does need to be fixed of course, if there needs to be more thought about it maybe it's best to take this fix for next release.
Agreed.
Well I can't argue against in anymore then.
Thanks, Nick
linux-stable-mirror@lists.linaro.org