There are several issues in pud_free_pmd_page() and pmd_free_pte_page(), which are used by vmalloc:
1. flush_tlb_kernel_range() does not flush paging-structure caches for inactive PCIDs, but we're using it when freeing kernel page tables. 2. flush_tlb_kernel_range() only flushes paging-structure cache entries that intersect with the flush range, and pud_free_pmd_page() passes in the first PAGE_SIZE bytes of the area covered by the PMD table; but pud_free_pmd_page() is actually designed to also remove PTE tables that were installed in the PMD table, so we need to also flush those. 3. [theoretical issue] invlpgb_kernel_range_flush() expects an exclusive range, it does: `nr = (info->end - addr) >> PAGE_SHIFT;` But pmd_free_pte_page() and pud_free_pmd_page() provide inclusive ranges (`addr, addr + PAGE_SIZE-1`).
To fix it:
1. Add a new helper flush_tlb_kernel_pgtable_range() for flushing paging structure caches for kernel page tables, and use that. 2. Flush PUD_SIZE instead of PAGE_SIZE in pud_free_pmd_page(). 3. Remove `-1` in end address calculations.
Note that since I'm not touching invlpgb_kernel_range_flush() or invlpgb_flush_addr_nosync() in this patch, the flush on PMD table deletion with INVLPGB might be a bit slow due to using PTE_STRIDE instead of PMD_STRIDE. I don't think that matters.
Backport notes: Kernels before 6.16 don't have invlpgb_kernel_range_flush(); for them, kernel_tlb_flush_pgtable() should just unconditionally call flush_tlb_all(). I am marking this as fixing commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces"); that is technically incorrect, since back then no paging-structure cache invalidations were performed at all, but probably makes the most sense here.
Cc: stable@vger.kernel.org Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Jann Horn jannh@google.com --- arch/x86/include/asm/tlb.h | 4 ++++ arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/pgtable.c | 13 +++++++++---- arch/x86/mm/tlb.c | 37 +++++++++++++++++++++++++++++++++++++ 4 files changed, 51 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..56a4b93a0742 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -153,6 +153,10 @@ static inline void invlpgb_flush_all(void) /* Flush addr, including globals, for all PCIDs. */ static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr) { + /* + * Don't set INVLPGB_FLAG_FINAL_ONLY here without adjusting + * kernel_tlb_flush_pgtable(). + */ __invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL); }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index e9b81876ebe4..06a4a2b7a9f5 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -318,6 +318,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 62777ba4de1a..0779ed02226c 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -745,8 +745,13 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
pud_clear(pud);
- /* INVLPG to clear all paging-structure caches */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1); + /* + * Clear paging-structure caches. + * Note that since this function can remove a PMD table together with the + * PTE tables it points to, we can't just flush the first PAGE_SIZE, we + * must flush PUD_SIZE! + */ + flush_tlb_kernel_pgtable_range(addr, addr + PUD_SIZE);
for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { @@ -778,8 +783,8 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) pte = (pte_t *)pmd_page_vaddr(*pmd); pmd_clear(pmd);
- /* INVLPG to clear all paging-structure caches */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1); + /* clear paging-structure caches */ + flush_tlb_kernel_pgtable_range(addr, addr + PAGE_SIZE);
free_page((unsigned long)pte);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..26b78451a7ed 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1525,6 +1525,22 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info) on_each_cpu(do_kernel_range_flush, info, 1); }
+static void kernel_tlb_flush_pgtable(struct flush_tlb_info *info) +{ + /* + * The special thing about removing kernel page tables is that we can't + * use a flush that just removes global TLB entries; we need one that + * flushes paging structure caches across all PCIDs. + * With INVLPGB, that's explicitly supported. + * Otherwise, there's no good way (INVLPG and INVPCID can't specifically + * target one address across all PCIDs), just throw everything away. + */ + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) + invlpgb_kernel_range_flush(info); + else + flush_tlb_all(); +} + void flush_tlb_kernel_range(unsigned long start, unsigned long end) { struct flush_tlb_info *info; @@ -1542,6 +1558,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) put_flush_tlb_info(); }
+/* + * Flush paging-structure cache entries for page tables covering the specified + * range and synchronize with concurrent hardware page table walks, in + * preparation for freeing page tables in the region. + * This must not be used for flushing translations to data pages. + */ +void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end) +{ + struct flush_tlb_info *info; + + /* We don't synchronize against GUP-fast. */ + VM_WARN_ON(start < TASK_SIZE_MAX); + + guard(preempt)(); + + info = get_flush_tlb_info(NULL, start, end, PMD_SHIFT, true, + TLB_GENERATION_INVALID); + kernel_tlb_flush_pgtable(info); + put_flush_tlb_info(); +} + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3().
--- base-commit: b1456f6dc167f7f101746e495bede2bac3d0e19f change-id: 20250528-x86-fix-vmap-pt-del-flush-f9fa4f287c93
On Wed, 2025-05-28 at 22:30 +0200, Jann Horn wrote:
Note that since I'm not touching invlpgb_kernel_range_flush() or invlpgb_flush_addr_nosync() in this patch, the flush on PMD table deletion with INVLPGB might be a bit slow due to using PTE_STRIDE instead of PMD_STRIDE. I don't think that matters.
Agreed that this is probably fine.
The performance sensitive kernel flushes mostly seem to be small, related to static key toggling, etc
Cc: stable@vger.kernel.org Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Jann Horn jannh@google.com
Reviewed-by: Rik van Riel riel@surriel.com
linux-stable-mirror@lists.linaro.org