On the following path, flush_tlb_range() can be used for zapping normal PMD entries (PMD entries that point to page tables) together with the PTE entries in the pointed-to page table:
collapse_pte_mapped_thp pmdp_collapse_flush flush_tlb_range
The arm64 version of flush_tlb_range() has a comment describing that it can be used for page table removal, and does not use any last-level invalidation optimizations. Fix the X86 version by making it behave the same way.
Currently, X86 only uses this information for the following two purposes, which I think means the issue doesn't have much impact:
- In native_flush_tlb_multi() for checking if lazy TLB CPUs need to be IPI'd to avoid issues with speculative page table walks. - In Hyper-V TLB paravirtualization, again for lazy TLB stuff.
The patch "x86/mm: only invalidate final translations with INVLPGB" which is currently under review (see https://lore.kernel.org/all/20241230175550.4046587-13-riel@surriel.com/) would probably be making the impact of this a lot worse.
Cc: stable@vger.kernel.org Fixes: 016c4d92cd16 ("x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range") Signed-off-by: Jann Horn jannh@google.com --- arch/x86/include/asm/tlbflush.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac70166618d8ad95116eb74 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask *cpumask, flush_tlb_mm_range((vma)->vm_mm, start, end, \ ((vma)->vm_flags & VM_HUGETLB) \ ? huge_page_shift(hstate_vma(vma)) \ - : PAGE_SHIFT, false) + : PAGE_SHIFT, true)
extern void flush_tlb_all(void); extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
--- base-commit: aa135d1d0902c49ed45bec98c61c1b4022652b7e change-id: 20250103-x86-collapse-flush-fix-fa87ac4d5834
On Fri, 2025-01-03 at 19:39 +0100, Jann Horn wrote:
02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac70166618d 8ad95116eb74 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask *cpumask, flush_tlb_mm_range((vma)->vm_mm, start, end, \ ((vma)->vm_flags & VM_HUGETLB) \ ? huge_page_shift(hstate_vma(vma)) \
: PAGE_SHIFT, false)
: PAGE_SHIFT, true)
The code looks good, but should this macro get a comment indicating that code that only frees pages, but not page tables, should be calling flush_tlb() instead?
On Fri, Jan 3, 2025 at 10:55 PM Rik van Riel riel@surriel.com wrote:
On Fri, 2025-01-03 at 19:39 +0100, Jann Horn wrote:
02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac70166618d 8ad95116eb74 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask *cpumask, flush_tlb_mm_range((vma)->vm_mm, start, end, \ ((vma)->vm_flags & VM_HUGETLB) \ ? huge_page_shift(hstate_vma(vma)) \
: PAGE_SHIFT, false)
: PAGE_SHIFT, true)
The code looks good, but should this macro get a comment indicating that code that only frees pages, but not page tables, should be calling flush_tlb() instead?
Documentation/core-api/cachetlb.rst seems to be the common place that's supposed to document the rules - the macro I'm touching is just the x86 implementation. (The arm64 implementation also has some fairly extensive comments that say flush_tlb_range() "also invalidates any walk-cache entries associated with translations for the specified address range" while flush_tlb_page() "only invalidates a single, last-level page-table entry and therefore does not affect any walk-caches".) I wouldn't want to add yet more documentation for this API inside the X86 code. I guess it would make sense to add pointers from the x86 code to the documentation (and copy the details about last-level TLBs from the arm64 code into the docs).
I don't see a function flush_tlb() outside of some (non-x86) arch code.
I don't know if it makes sense to tell developers to not use flush_tlb_range() for freeing pages. If the performance of flush_tlb_range() actually is an issue, I guess one fix would be to refactor this and add a parameter or something?
On Fri, 2025-01-03 at 23:11 +0100, Jann Horn wrote:
On Fri, Jan 3, 2025 at 10:55 PM Rik van Riel riel@surriel.com wrote:
On Fri, 2025-01-03 at 19:39 +0100, Jann Horn wrote:
02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac7016 6618d 8ad95116eb74 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask *cpumask, flush_tlb_mm_range((vma)->vm_mm, start, end, \ ((vma)->vm_flags & VM_HUGETLB) \ ? huge_page_shift(hstate_vma(vma)) \ - : PAGE_SHIFT, false) + : PAGE_SHIFT, true)
The code looks good, but should this macro get a comment indicating that code that only frees pages, but not page tables, should be calling flush_tlb() instead?
Documentation/core-api/cachetlb.rst seems to be the common place that's supposed to document the rules - the macro I'm touching is just the x86 implementation. (The arm64 implementation also has some fairly extensive comments that say flush_tlb_range() "also invalidates any walk-cache entries associated with translations for the specified address range" while flush_tlb_page() "only invalidates a single, last-level page-table entry and therefore does not affect any walk-caches".) I wouldn't want to add yet more documentation for this API inside the X86 code. I guess it would make sense to add pointers from the x86 code to the documentation (and copy the details about last-level TLBs from the arm64 code into the docs).
I don't see a function flush_tlb() outside of some (non-x86) arch code.
I see zap_pte_range() calling tlb_flush_mmu(), which calls tlb_flush_mmu_tlbonly() in include/asm-generic/tlb.h, which in turn calls tlb_flush().
The asm-generic version of tlb_flush() goes through flush_tlb_mm(), which on x86 would call flush_tlb_mm_range with flush_tables = true.
Luckily x86 seems to have its own implementation of tlb_flush(), which avoids that issue.
I don't know if it makes sense to tell developers to not use flush_tlb_range() for freeing pages. If the performance of flush_tlb_range() actually is an issue, I guess one fix would be to refactor this and add a parameter or something?
I don't know whether this is a real issue on architectures other than x86.
For now it looks like the code does the right thing when only pages are being freed, so we may not need that parameter.
linux-stable-mirror@lists.linaro.org