A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), MADV_DONTNEED is the only one who may do page zapping in parallel and it doesn't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This causes a bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the wrong page table state to architecture specific TLB flush operations.
So, removing __tlb_reset_range() sounds sane. This may cause more TLB flush for MADV_DONTNEED, but it should be not called very often, hence the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux....
Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com --- mm/mmu_gather.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush * the TLB by observing pte_none|!pte_dirty, for example so flush TLB * forcefully if we detect parallel PTE batching threads. + * + * munmap() may change mapping under non-excluse lock and also free + * page tables. Do not call __tlb_reset_range() for it. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); + if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start); - }
tlb_flush_mmu(tlb);
Hi all, [+Peter]
Apologies for the delay; I'm attending a conference this week so it's tricky to keep up with email.
On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), MADV_DONTNEED is the only one who may do page zapping in parallel and it doesn't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This causes a bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the wrong page table state to architecture specific TLB flush operations.
Yikes. Is it actually safe to run free_pgtables() concurrently for a given mm?
So, removing __tlb_reset_range() sounds sane. This may cause more TLB flush for MADV_DONTNEED, but it should be not called very often, hence the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
I'm still paging the nested flush logic back in, but I have some comments on the patch below.
[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux....
Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com
mm/mmu_gather.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush
Urgh, we should rewrite this comment while we're here so that it makes sense...
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB * forcefully if we detect parallel PTE batching threads.
*
* munmap() may change mapping under non-excluse lock and also free
*/* page tables. Do not call __tlb_reset_range() for it.
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
- if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start);
- }
I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending?
Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice?
Will
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
Hi all, [+Peter]
Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.
Also added Nadav and Minchan who've poked at this issue before. And Mel, because he loves these things :-)
Apologies for the delay; I'm attending a conference this week so it's tricky to keep up with email.
On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), MADV_DONTNEED is the only one who may do page zapping in parallel and it doesn't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This causes a bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
Please don't _EVER_ refer to external sources to describe the actual bug a patch is fixing. That is the primary purpose of the Changelog.
Worse, the email you reference does _NOT_ describe the actual problem. Nor do you.
wrong page table state to architecture specific TLB flush operations.
Yikes. Is it actually safe to run free_pgtables() concurrently for a given mm?
Yeah.. sorta.. it's been a source of 'interesting' things. This really isn't the first issue here.
Also, change_protection_range() is 'fun' too.
So, removing __tlb_reset_range() sounds sane. This may cause more TLB flush for MADV_DONTNEED, but it should be not called very often, hence the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
I'm still paging the nested flush logic back in, but I have some comments on the patch below.
[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux....
Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com
mm/mmu_gather.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush
Urgh, we should rewrite this comment while we're here so that it makes sense...
Yeah, that's atrocious. We should put the actual race in there.
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB * forcefully if we detect parallel PTE batching threads.
*
* munmap() may change mapping under non-excluse lock and also free
*/* page tables. Do not call __tlb_reset_range() for it.
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
- if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start);
- }
I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending?
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared.
Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice?
Quite a bit for certain workloads I imagine, that was the whole point of doing it.
Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /* - * If there are parallel threads are doing PTE changes on same range - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB - * flush by batching, a thread has stable TLB entry can fail to flush - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB - * forcefully if we detect parallel PTE batching threads. + * Sensible comment goes here.. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { + /* + * Since we're can't tell what we actually should have + * flushed flush everything in the given range. + */ + tlb->start = start; + tlb->end = end; + tlb->freed_tables = 1; + tlb->cleared_ptes = 1; + tlb->cleared_pmds = 1; + tlb->cleared_puds = 1; + tlb->cleared_p4ds = 1; }
tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared.
Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
There's another 'fun' issue I think. For architectures like ARM that have range invalidation and care about VM_EXEC for I$ invalidation, the below doesn't quite work right either.
I suspect we also have to force: tlb->vma_exec = 1.
And I don't think there's an architecture that cares, but depending on details I can construct cases where any setting of tlb->vm_hugetlb is wrong, so that is _awesome_. But I suspect the sane thing for now is to force it 0.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
- if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
}tlb->cleared_p4ds = 1;
tlb_flush_mmu(tlb);
On 5/9/19 3:54 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared. Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
There's another 'fun' issue I think. For architectures like ARM that have range invalidation and care about VM_EXEC for I$ invalidation, the below doesn't quite work right either.
I suspect we also have to force: tlb->vma_exec = 1.
Isn't the below code in tlb_flush enough to guarantee this?
... } else if (tlb->end) { struct vm_area_struct vma = { .vm_mm = tlb->mm, .vm_flags = (tlb->vma_exec ? VM_EXEC : 0) | (tlb->vma_huge ? VM_HUGETLB : 0), }; ...
And I don't think there's an architecture that cares, but depending on details I can construct cases where any setting of tlb->vm_hugetlb is wrong, so that is _awesome_. But I suspect the sane thing for now is to force it 0.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
- if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
}tlb->cleared_p4ds = 1;
tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 11:35:55AM -0700, Yang Shi wrote:
On 5/9/19 3:54 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared. Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
There's another 'fun' issue I think. For architectures like ARM that have range invalidation and care about VM_EXEC for I$ invalidation, the below doesn't quite work right either.
I suspect we also have to force: tlb->vma_exec = 1.
Isn't the below code in tlb_flush enough to guarantee this?
... } else if (tlb->end) { struct vm_area_struct vma = { .vm_mm = tlb->mm, .vm_flags = (tlb->vma_exec ? VM_EXEC : 0) | (tlb->vma_huge ? VM_HUGETLB : 0), };
Only when vma_exec is actually set... and there is no guarantee of that in the concurrent path (the last VMA we iterate might not be executable, but the TLBI we've missed might have been).
More specific, the 'fun' case is if we have no present page in the whole executable page, in that case tlb->end == 0 and we never call into the arch code, never giving it chance to flush I$.
I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending?
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared.
Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice?
Quite a bit for certain workloads I imagine, that was the whole point of doing it.
Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
That is my understanding, only last level is flushed, which is not enough.
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
I applied it (and fixed small typo: s/tlb->full_mm/tlb->fullmm/). It fixes the problem for me.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
- if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
}tlb->cleared_p4ds = 1;
tlb_flush_mmu(tlb);
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
Hi all, [+Peter]
Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.
Also added Nadav and Minchan who've poked at this issue before. And Mel, because he loves these things :-)
Apologies for the delay; I'm attending a conference this week so it's tricky to keep up with email.
On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), MADV_DONTNEED is the only one who may do page zapping in parallel and it doesn't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This causes a bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
Please don't _EVER_ refer to external sources to describe the actual bug a patch is fixing. That is the primary purpose of the Changelog.
Worse, the email you reference does _NOT_ describe the actual problem. Nor do you.
wrong page table state to architecture specific TLB flush operations.
Yikes. Is it actually safe to run free_pgtables() concurrently for a given mm?
Yeah.. sorta.. it's been a source of 'interesting' things. This really isn't the first issue here.
Also, change_protection_range() is 'fun' too.
So, removing __tlb_reset_range() sounds sane. This may cause more TLB flush for MADV_DONTNEED, but it should be not called very often, hence the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
I'm still paging the nested flush logic back in, but I have some comments on the patch below.
[1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...
Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com
mm/mmu_gather.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush
Urgh, we should rewrite this comment while we're here so that it makes sense...
Yeah, that's atrocious. We should put the actual race in there.
- the TLB by observing pte_none|!pte_dirty, for example so flush TLB
- forcefully if we detect parallel PTE batching threads.
*
* munmap() may change mapping under non-excluse lock and also free
*/* page tables. Do not call __tlb_reset_range() for it.
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
- if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start);
- }
I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending?
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared.
Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice?
Quite a bit for certain workloads I imagine, that was the whole point of doing it.
Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote:
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu();
// continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
This should not be too hard to make happen.
On 5/9/19 11:24 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu(); // continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
I'm not quite sure if this is the case Jan really met. But, according to his test, once correct tlb->freed_tables and tlb->cleared_* are set, his test works well.
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
Not sure if this will scale well.
This should not be too hard to make happen.
----- Original Message -----
On 5/9/19 11:24 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu(); // continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
I'm not quite sure if this is the case Jan really met. But, according to his test, once correct tlb->freed_tables and tlb->cleared_* are set, his test works well.
My theory was following sequence:
t1: map_write_unmap() t2: dummy()
map_address = mmap() map_address[i] = 'b' munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1
pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() tlb->freed_tables = 0 tlb->cleared_pmds = 0 __flush_tlb_range(last_level = 0) ... map_address = mmap() map_address[i] = 'b' <page fault loop> # PTE appeared valid to me, # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"
I'm happy to apply/run any debug patches to get more data that would help.
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
Not sure if this will scale well.
This should not be too hard to make happen.
On 5/9/19 2:06 PM, Jan Stancek wrote:
----- Original Message -----
On 5/9/19 11:24 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu(); // continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
I'm not quite sure if this is the case Jan really met. But, according to his test, once correct tlb->freed_tables and tlb->cleared_* are set, his test works well.
My theory was following sequence:
t1: map_write_unmap() t2: dummy()
map_address = mmap() map_address[i] = 'b' munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1
pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED)
I'm not quite familiar with the implementation detail of pthread_exit(), does pthread_exit() call MADV_DONTNEED all the time? I don't see your test call it. If so this pattern is definitely possible.
zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() tlb->freed_tables = 0 tlb->cleared_pmds = 0 __flush_tlb_range(last_level = 0) ... map_address = mmap() map_address[i] = 'b' <page fault loop> # PTE appeared valid to me, # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"
I'm happy to apply/run any debug patches to get more data that would help.
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
Not sure if this will scale well.
This should not be too hard to make happen.
----- Original Message -----
On 5/9/19 2:06 PM, Jan Stancek wrote:
----- Original Message -----
On 5/9/19 11:24 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
On May 9, 2019, at 3:38 AM, Peter Zijlstra peterz@infradead.org wrote: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
tlb->cleared_p4ds = 1;
}
tlb_flush_mmu(tlb);
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu(); // continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
I'm not quite sure if this is the case Jan really met. But, according to his test, once correct tlb->freed_tables and tlb->cleared_* are set, his test works well.
My theory was following sequence:
t1: map_write_unmap() t2: dummy()
map_address = mmap() map_address[i] = 'b' munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1
pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED)
I'm not quite familiar with the implementation detail of pthread_exit(), does pthread_exit() call MADV_DONTNEED all the time? I don't see your test call it.
It's called by glibc: https://sourceware.org/git/?p=glibc.git%3Ba=blob%3Bf=nptl/allocatestack.c%3B... https://sourceware.org/git/?p=glibc.git%3Ba=blob%3Bf=nptl/pthread_create.c%3...
(gdb) bt #0 madvise () at ../sysdeps/unix/syscall-template.S:78 #1 0x0000ffffbe7679f8 in advise_stack_range (guardsize=<optimized out>, pd=281474976706191, size=<optimized out>, mem=0xffffbddd0000) at allocatestack.c:392 #2 start_thread (arg=0xffffffffee8f) at pthread_create.c:576 #3 0x0000ffffbe6b157c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
Dump of assembler code for function madvise: => 0x0000ffffbe6adaf0 <+0>: mov x8, #0xe9 // #233 0x0000ffffbe6adaf4 <+4>: svc #0x0 0x0000ffffbe6adaf8 <+8>: cmn x0, #0xfff 0x0000ffffbe6adafc <+12>: b.cs 0xffffbe6adb04 <madvise+20> // b.hs, b.nlast 0x0000ffffbe6adb00 <+16>: ret 0x0000ffffbe6adb04 <+20>: b 0xffffbe600e18 <__GI___syscall_error>
If so this pattern is definitely possible.
zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() tlb->freed_tables = 0 tlb->cleared_pmds = 0 __flush_tlb_range(last_level = 0) ... map_address = mmap() map_address[i] = 'b' <page fault loop> # PTE appeared valid to me, # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"
I'm happy to apply/run any debug patches to get more data that would help.
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
Not sure if this will scale well.
This should not be too hard to make happen.
On 5/9/19 3:38 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
Hi all, [+Peter]
Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.
Sorry for that, I didn't realize we have mmu_gather maintainers. I should run maintainer.pl.
Also added Nadav and Minchan who've poked at this issue before. And Mel, because he loves these things :-)
Apologies for the delay; I'm attending a conference this week so it's tricky to keep up with email.
On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), MADV_DONTNEED is the only one who may do page zapping in parallel and it doesn't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This causes a bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
Please don't _EVER_ refer to external sources to describe the actual bug a patch is fixing. That is the primary purpose of the Changelog.
Worse, the email you reference does _NOT_ describe the actual problem. Nor do you.
Sure, will articulate the real bug in the commit log.
wrong page table state to architecture specific TLB flush operations.
Yikes. Is it actually safe to run free_pgtables() concurrently for a given mm?
Yeah.. sorta.. it's been a source of 'interesting' things. This really isn't the first issue here.
Also, change_protection_range() is 'fun' too.
So, removing __tlb_reset_range() sounds sane. This may cause more TLB flush for MADV_DONTNEED, but it should be not called very often, hence the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together.
I'm still paging the nested flush logic back in, but I have some comments on the patch below.
[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux....
Reported-by: Jan Stancek jstancek@redhat.com Tested-by: Jan Stancek jstancek@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Jan Stancek jstancek@redhat.com
mm/mmu_gather.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush
Urgh, we should rewrite this comment while we're here so that it makes sense...
Yeah, that's atrocious. We should put the actual race in there.
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB * forcefully if we detect parallel PTE batching threads.
*
* munmap() may change mapping under non-excluse lock and also free
*/* page tables. Do not call __tlb_reset_range() for it.
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
- if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start);
- }
I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending?
That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared.
Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice?
Quite a bit for certain workloads I imagine, that was the whole point of doing it.
Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens?
Yes.
Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do.
Thanks for the suggestion, will do in v2.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
- if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
}tlb->cleared_p4ds = 1;
tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
*/* Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
__tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
- if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
/*
* Since we're can't tell what we actually should have
* flushed flush everything in the given range.
*/
tlb->start = start;
tlb->end = end;
tlb->freed_tables = 1;
tlb->cleared_ptes = 1;
tlb->cleared_pmds = 1;
tlb->cleared_puds = 1;
}tlb->cleared_p4ds = 1;
tlb_flush_mmu(tlb);
So PPC-radix has page-size dependent TLBI, but the above doesn't work for them, because they use the tlb_change_page_size() interface and don't look at tlb->cleared_p*().
Concequently, they have their own special magic :/
Nick, how about you use the tlb_change_page_size() interface to find/flush on the page-size boundaries, but otherwise use the tlb->cleared_p* flags to select which actual sizes to flush?
AFAICT that should work just fine for you guys. Maybe something like so?
(fwiw, there's an aweful lot of almost identical functions there)
---
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c index 6a23b9ebd2a1..efc99ef78db6 100644 --- a/arch/powerpc/mm/tlb-radix.c +++ b/arch/powerpc/mm/tlb-radix.c @@ -692,7 +692,7 @@ static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_
static inline void __radix__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end, - bool flush_all_sizes) + bool pflush, bool hflush, bool gflush)
{ unsigned long pid; @@ -734,14 +734,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, _tlbie_pid(pid, RIC_FLUSH_TLB); } } else { - bool hflush = flush_all_sizes; - bool gflush = flush_all_sizes; unsigned long hstart, hend; unsigned long gstart, gend;
- if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) - hflush = true; - if (hflush) { hstart = (start + PMD_SIZE - 1) & PMD_MASK; hend = end & PMD_MASK; @@ -758,7 +753,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
asm volatile("ptesync": : :"memory"); if (local) { - __tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize); + if (pflush) + __tlbiel_va_range(start, end, pid, + page_size, mmu_virtual_psize); if (hflush) __tlbiel_va_range(hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M); @@ -767,7 +764,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, PUD_SIZE, MMU_PAGE_1G); asm volatile("ptesync": : :"memory"); } else { - __tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize); + if (pflush) + __tlbie_va_range(start, end, pid, + page_size, mmu_virtual_psize); if (hflush) __tlbie_va_range(hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M); @@ -785,12 +784,17 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{ + bool hflush = false; + #ifdef CONFIG_HUGETLB_PAGE if (is_vm_hugetlb_page(vma)) return radix__flush_hugetlb_tlb_range(vma, start, end); #endif
- __radix__flush_tlb_range(vma->vm_mm, start, end, false); + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + hflush = true; + + __radix__flush_tlb_range(vma->vm_mm, start, end, true, hflush, false); } EXPORT_SYMBOL(radix__flush_tlb_range);
@@ -881,49 +885,14 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE) - } else if (mm_tlb_flush_nested(mm)) { - /* - * If there is a concurrent invalidation that is clearing ptes, - * then it's possible this invalidation will miss one of those - * cleared ptes and miss flushing the TLB. If this invalidate - * returns before the other one flushes TLBs, that can result - * in it returning while there are still valid TLBs inside the - * range to be invalidated. - * - * See mm/memory.c:tlb_finish_mmu() for more details. - * - * The solution to this is ensure the entire range is always - * flushed here. The problem for powerpc is that the flushes - * are page size specific, so this "forced flush" would not - * do the right thing if there are a mix of page sizes in - * the range to be invalidated. So use __flush_tlb_range - * which invalidates all possible page sizes in the range. - * - * PWC flush probably is not be required because the core code - * shouldn't free page tables in this path, but accounting - * for the possibility makes us a bit more robust. - * - * need_flush_all is an uncommon case because page table - * teardown should be done with exclusive locks held (but - * after locks are dropped another invalidate could come - * in), it could be optimized further if necessary. - */ - if (!tlb->need_flush_all) - __radix__flush_tlb_range(mm, start, end, true); - else - radix__flush_all_mm(mm); -#endif - } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { - if (!tlb->need_flush_all) - radix__flush_tlb_mm(mm); - else - radix__flush_all_mm(mm); } else { if (!tlb->need_flush_all) - radix__flush_tlb_range_psize(mm, start, end, psize); + __radix__flush_tlb_range(mm, start, end, + tlb->cleared_pte, + tlb->cleared_pmd, + tlb->cleared_pud); else - radix__flush_tlb_pwc_range_psize(mm, start, end, psize); + radix__flush_all_mm(mm); } tlb->need_flush_all = 0; }
linux-stable-mirror@lists.linaro.org