As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
There are two optimizations to be had:
(1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table.
Locking prevents that any of these PMD tables could get reuse before we drop the lock.
(2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer.
Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs.
Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast.
(if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this)
So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation.
We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables.
Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables().
From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held.
Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier.
Document it all properly.
Notes about tlb_remove_table_sync_one() interaction with unsharing:
There are two fairly tricky things:
(1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set.
The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch.
Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy.
This patch changes nothing in this regard.
(2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one().
This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs.
Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later.
There are plenty more cleanups to be had, but they have to wait until this is fixed.
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org --- include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 1fff717cae510..706416babb3d6 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -364,6 +364,20 @@ struct mmu_gather { unsigned int vma_huge : 1; unsigned int vma_pfn : 1;
+ /* + * Did we unshare (unmap) any shared page tables? For now only + * used for hugetlb PMD table sharing. + */ + unsigned int unshared_tables : 1; + + /* + * Did we unshare any page tables such that they are now exclusive + * and could get reused+modified by the new owner? When setting this + * flag, "unshared_tables" will be set as well. For now only used + * for hugetlb PMD table sharing. + */ + unsigned int fully_unshared_tables : 1; + unsigned int batch_count;
#ifndef CONFIG_MMU_GATHER_NO_GATHER @@ -400,6 +414,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->cleared_pmds = 0; tlb->cleared_puds = 0; tlb->cleared_p4ds = 0; + tlb->unshared_tables = 0; /* * Do not reset mmu_gather::vma_* fields here, we do not * call into tlb_start_vma() again to set them if there is an @@ -484,7 +499,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) * these bits. */ if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds || - tlb->cleared_puds || tlb->cleared_p4ds)) + tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables)) return;
tlb_flush(tlb); @@ -773,6 +788,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) } #endif
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt, + unsigned long addr) +{ + /* + * The caller must make sure that concurrent unsharing + exclusive + * reuse is impossible until tlb_flush_unshared_tables() was called. + */ + VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt)); + ptdesc_pmd_pts_dec(pt); + + /* Clearing a PUD pointing at a PMD table with PMD leaves. */ + tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE); + + /* + * If the page table is now exclusively owned, we fully unshared + * a page table. + */ + if (!ptdesc_pmd_is_shared(pt)) + tlb->fully_unshared_tables = true; + tlb->unshared_tables = true; +} + +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) +{ + /* + * As soon as the caller drops locks to allow for reuse of + * previously-shared tables, these tables could get modified and + * even reused outside of hugetlb context, so we have to make sure that + * any page table walkers (incl. TLB, GUP-fast) are aware of that + * change. + * + * Even if we are not fully unsharing a PMD table, we must + * flush the TLB for the unsharer now. + */ + if (tlb->unshared_tables) + tlb_flush_mmu_tlbonly(tlb); + + /* + * Similarly, we must make sure that concurrent GUP-fast will not + * walk previously-shared page tables that are getting modified+reused + * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast. + * + * We only perform this when we are the last sharer of a page table, + * as the IPI will reach all CPUs: any GUP-fast. + * + * Note that on configs where tlb_remove_table_sync_one() is a NOP, + * the expectation is that the tlb_flush_mmu_tlbonly() would have issued + * required IPIs already for us. + */ + if (tlb->fully_unshared_tables) { + tlb_remove_table_sync_one(); + tlb->fully_unshared_tables = false; + } +} +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ + #endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */ diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 03c8725efa289..63b248c6bfd47 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h); -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep); +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end);
@@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma); int hugetlb_vma_trylock_write(struct vm_area_struct *vma); void hugetlb_vma_assert_locked(struct vm_area_struct *vma); void hugetlb_vma_lock_release(struct kref *kref); -long hugetlb_change_protection(struct vm_area_struct *vma, +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags); void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write( return NULL; }
-static inline int huge_pmd_unshare(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) +static inline int huge_pmd_unshare(struct mmu_gather *tlb, + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { return 0; }
+static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb, + struct vm_area_struct *vma) +{ +} + static inline void adjust_range_if_pmd_sharing_possible( struct vm_area_struct *vma, unsigned long *start, unsigned long *end) @@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio, { }
-static inline long hugetlb_change_protection( +static inline long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3c77cdef12a32..7fef0b94b5d1e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long last_addr_mask; pte_t *src_pte, *dst_pte; struct mmu_notifier_range range; - bool shared_pmd = false; + struct mmu_gather tlb;
+ tlb_gather_mmu(&tlb, vma->vm_mm); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, old_end); adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); @@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) continue;
- if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { - shared_pmd = true; + if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) { old_addr |= last_addr_mask; new_addr |= last_addr_mask; continue; } + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); if (!dst_pte) @@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); }
- if (shared_pmd) - flush_hugetlb_tlb_range(vma, range.start, range.end); - else - flush_hugetlb_tlb_range(vma, old_end - len, old_end); + tlb_flush_mmu_tlbonly(&tlb); + huge_pmd_unshare_flush(&tlb, vma); + mmu_notifier_invalidate_range_end(&range); i_mmap_unlock_write(mapping); hugetlb_vma_unlock_write(vma); + tlb_finish_mmu(&tlb);
return len + old_addr - old_end; } @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long sz = huge_page_size(h); bool adjust_reservation; unsigned long last_addr_mask; - bool force_flush = false;
WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON(start & ~huge_page_mask(h)); @@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, }
ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, vma, address, ptep)) { + if (huge_pmd_unshare(tlb, vma, address, ptep)) { spin_unlock(ptl); - tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); - force_flush = true; address |= last_addr_mask; continue; } @@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, } tlb_end_vma(tlb, vma);
- /* - * There is nothing protecting a previously-shared page table that we - * unshared through huge_pmd_unshare() from getting freed after we - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() - * succeeded, flush the range corresponding to the pud. - */ - if (force_flush) - tlb_flush_mmu_tlbonly(tlb); + huge_pmd_unshare_flush(tlb, vma); }
void __hugetlb_zap_begin(struct vm_area_struct *vma, @@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, } #endif /* CONFIG_USERFAULTFD */
-long hugetlb_change_protection(struct vm_area_struct *vma, +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { @@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte_t pte; struct hstate *h = hstate_vma(vma); long pages = 0, psize = huge_page_size(h); - bool shared_pmd = false; struct mmu_notifier_range range; unsigned long last_addr_mask; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; @@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, } } ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, vma, address, ptep)) { + if (huge_pmd_unshare(tlb, vma, address, ptep)) { /* * When uffd-wp is enabled on the vma, unshare * shouldn't happen at all. Warn about it if it @@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); pages++; spin_unlock(ptl); - shared_pmd = true; address |= last_addr_mask; continue; } @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++; + tlb_remove_huge_tlb_entry(h, tlb, ptep, address); }
next: spin_unlock(ptl); cond_resched(); } - /* - * There is nothing protecting a previously-shared page table that we - * unshared through huge_pmd_unshare() from getting freed after we - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() - * succeeded, flush the range corresponding to the pud. - */ - if (shared_pmd) - flush_hugetlb_tlb_range(vma, range.start, range.end); - else - flush_hugetlb_tlb_range(vma, start, end); + + tlb_flush_mmu_tlbonly(tlb); + huge_pmd_unshare_flush(tlb, vma); /* * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are * downgrading page table protection not changing it to point to a new @@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, return pte; }
-/* - * unmap huge page backed by shared pte. +/** + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users + * @tlb: the current mmu_gather. + * @vma: the vma covering the pmd table. + * @addr: the address we are trying to unshare. + * @ptep: pointer into the (pmd) page table. + * + * Called with the page table lock held, the i_mmap_rwsem held in write mode + * and the hugetlb vma lock held in write mode. * - * Called with page table lock held. + * Note: The caller must call huge_pmd_unshare_flush() before dropping the + * i_mmap_rwsem. * - * returns: 1 successfully unmapped a shared pte page - * 0 the underlying pte page is not shared, or it is the last user + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it + * was not a shared PMD table. */ -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) { unsigned long sz = huge_page_size(hstate_vma(vma)); + struct mm_struct *mm = vma->vm_mm; pgd_t *pgd = pgd_offset(mm, addr); p4d_t *p4d = p4d_offset(pgd, addr); pud_t *pud = pud_offset(p4d, addr); @@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, i_mmap_assert_write_locked(vma->vm_file->f_mapping); hugetlb_vma_assert_locked(vma); pud_clear(pud); - /* - * Once our caller drops the rmap lock, some other process might be - * using this page table as a normal, non-hugetlb page table. - * Wait for pending gup_fast() in other threads to finish before letting - * that happen. - */ - tlb_remove_table_sync_one(); - ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep)); + + tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr); + mm_dec_nr_pmds(mm); return 1; }
+/* + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls + * @tlb: the current mmu_gather. + * @vma: the vma covering the pmd table. + * + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table + * unsharing with concurrent page table walkers. + * + * This function must be called after a sequence of huge_pmd_unshare() + * calls while still holding the i_mmap_rwsem. + */ +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ + /* + * We must synchronize page table unsharing such that nobody will + * try reusing a previously-shared page table while it might still + * be in use by previous sharers (TLB, GUP_fast). + */ + i_mmap_assert_write_locked(vma->vm_file->f_mapping); + + tlb_flush_unshared_tables(tlb); +} + #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, return NULL; }
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) { return 0; }
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ +} + void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long sz = huge_page_size(h); struct mm_struct *mm = vma->vm_mm; struct mmu_notifier_range range; + struct mmu_gather tlb; unsigned long address; spinlock_t *ptl; pte_t *ptep; @@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, if (start >= end) return;
+ tlb_gather_mmu(&tlb, mm); flush_cache_range(vma, start, end); /* * No need to call adjust_range_if_pmd_sharing_possible(), because @@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, if (!ptep) continue; ptl = huge_pte_lock(h, mm, ptep); - huge_pmd_unshare(mm, vma, address, ptep); + huge_pmd_unshare(&tlb, vma, address, ptep); spin_unlock(ptl); } - flush_hugetlb_tlb_range(vma, start, end); + huge_pmd_unshare_flush(&tlb, vma); if (take_locks) { i_mmap_unlock_write(vma->vm_file->f_mapping); hugetlb_vma_unlock_write(vma); @@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, * Documentation/mm/mmu_notifier.rst. */ mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb); }
/* diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 247e3f9db6c7a..030a162a263ba 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -426,6 +426,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, #endif tlb->vma_pfn = 0;
+ tlb->fully_unshared_tables = 0; __tlb_reset_range(tlb); inc_tlb_flush_pending(tlb->mm); } @@ -468,6 +469,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm) */ void tlb_finish_mmu(struct mmu_gather *tlb) { + /* + * We expect an earlier huge_pmd_unshare_flush() call to sort this out, + * due to complicated locking requirements with page table unsharing. + */ + VM_WARN_ON_ONCE(tlb->fully_unshared_tables); + /* * If there are parallel threads are doing PTE changes on same range * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB diff --git a/mm/mprotect.c b/mm/mprotect.c index 283889e4f1cec..5c330e817129e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb, #endif
if (is_vm_hugetlb_page(vma)) - pages = hugetlb_change_protection(vma, start, end, newprot, + pages = hugetlb_change_protection(tlb, vma, start, end, newprot, cp_flags); else pages = change_protection_range(tlb, vma, start, end, newprot, diff --git a/mm/rmap.c b/mm/rmap.c index 748f48727a162..d6799afe11147 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -76,7 +76,7 @@ #include <linux/mm_inline.h> #include <linux/oom.h>
-#include <asm/tlbflush.h> +#include <asm/tlb.h>
#define CREATE_TRACE_POINTS #include <trace/events/migrate.h> @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * if unsuccessful. */ if (!anon) { + struct mmu_gather tlb; + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); if (!hugetlb_vma_trylock_write(vma)) goto walk_abort; - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { + + tlb_gather_mmu(&tlb, mm); + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); - flush_tlb_range(vma, - range.start, range.end); + huge_pmd_unshare_flush(&tlb, vma); + tlb_finish_mmu(&tlb); /* * The PMD table was unmapped, * consequently unmapping the folio. @@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_done; } hugetlb_vma_unlock_write(vma); + tlb_finish_mmu(&tlb); } pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); if (pte_dirty(pteval)) @@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * fail if unsuccessful. */ if (!anon) { + struct mmu_gather tlb; + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); if (!hugetlb_vma_trylock_write(vma)) { page_vma_mapped_walk_done(&pvmw); ret = false; break; } - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { - hugetlb_vma_unlock_write(vma); - flush_tlb_range(vma, - range.start, range.end);
+ tlb_gather_mmu(&tlb, mm); + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { + hugetlb_vma_unlock_write(vma); + huge_pmd_unshare_flush(&tlb, vma); + tlb_finish_mmu(&tlb); /* * The PMD table was unmapped, * consequently unmapping the folio. @@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, break; } hugetlb_vma_unlock_write(vma); + tlb_finish_mmu(&tlb); } /* Nuke the hugetlb page table entry */ pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
There are two optimizations to be had:
(1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table.
Locking prevents that any of these PMD tables could get reuse before we drop the lock.(2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer.
Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this)So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation.
We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables.
Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables().
From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held.
Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier.
Document it all properly.
Notes about tlb_remove_table_sync_one() interaction with unsharing:
There are two fairly tricky things:
(1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard.(2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs.Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later.
There are plenty more cleanups to be had, but they have to wait until this is fixed.
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
After discussion on v1 4/4, and running a git range-diff between the two, this LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 1fff717cae510..706416babb3d6 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -364,6 +364,20 @@ struct mmu_gather { unsigned int vma_huge : 1; unsigned int vma_pfn : 1;
- /*
* Did we unshare (unmap) any shared page tables? For now only* used for hugetlb PMD table sharing.*/- unsigned int unshared_tables : 1;
- /*
* Did we unshare any page tables such that they are now exclusive* and could get reused+modified by the new owner? When setting this* flag, "unshared_tables" will be set as well. For now only used* for hugetlb PMD table sharing.*/- unsigned int fully_unshared_tables : 1;
- unsigned int batch_count;
#ifndef CONFIG_MMU_GATHER_NO_GATHER @@ -400,6 +414,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->cleared_pmds = 0; tlb->cleared_puds = 0; tlb->cleared_p4ds = 0;
- tlb->unshared_tables = 0; /*
- Do not reset mmu_gather::vma_* fields here, we do not
- call into tlb_start_vma() again to set them if there is an
@@ -484,7 +499,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) * these bits. */ if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
tlb->cleared_puds || tlb->cleared_p4ds))
tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))return;
tlb_flush(tlb);
@@ -773,6 +788,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) } #endif
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
unsigned long addr)+{
- /*
* The caller must make sure that concurrent unsharing + exclusive* reuse is impossible until tlb_flush_unshared_tables() was called.*/- VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
- ptdesc_pmd_pts_dec(pt);
- /* Clearing a PUD pointing at a PMD table with PMD leaves. */
- tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
- /*
* If the page table is now exclusively owned, we fully unshared* a page table.*/- if (!ptdesc_pmd_is_shared(pt))
tlb->fully_unshared_tables = true;- tlb->unshared_tables = true;
+}
+static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) +{
- /*
* As soon as the caller drops locks to allow for reuse of* previously-shared tables, these tables could get modified and* even reused outside of hugetlb context, so we have to make sure that* any page table walkers (incl. TLB, GUP-fast) are aware of that* change.** Even if we are not fully unsharing a PMD table, we must* flush the TLB for the unsharer now.*/- if (tlb->unshared_tables)
tlb_flush_mmu_tlbonly(tlb);- /*
* Similarly, we must make sure that concurrent GUP-fast will not* walk previously-shared page tables that are getting modified+reused* elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.** We only perform this when we are the last sharer of a page table,* as the IPI will reach all CPUs: any GUP-fast.** Note that on configs where tlb_remove_table_sync_one() is a NOP,* the expectation is that the tlb_flush_mmu_tlbonly() would have issued* required IPIs already for us.*/- if (tlb->fully_unshared_tables) {
tlb_remove_table_sync_one();tlb->fully_unshared_tables = false;- }
+} +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
#endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */ diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 03c8725efa289..63b248c6bfd47 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h); -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end);
@@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma); int hugetlb_vma_trylock_write(struct vm_area_struct *vma); void hugetlb_vma_assert_locked(struct vm_area_struct *vma); void hugetlb_vma_lock_release(struct kref *kref); -long hugetlb_change_protection(struct vm_area_struct *vma, +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags); void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write( return NULL; }
-static inline int huge_pmd_unshare(struct mm_struct *mm,
struct vm_area_struct *vma,unsigned long addr, pte_t *ptep)+static inline int huge_pmd_unshare(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep){ return 0; }
+static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
struct vm_area_struct *vma)+{ +}
static inline void adjust_range_if_pmd_sharing_possible( struct vm_area_struct *vma, unsigned long *start, unsigned long *end) @@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio, { }
-static inline long hugetlb_change_protection( +static inline long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3c77cdef12a32..7fef0b94b5d1e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long last_addr_mask; pte_t *src_pte, *dst_pte; struct mmu_notifier_range range;
- bool shared_pmd = false;
struct mmu_gather tlb;
tlb_gather_mmu(&tlb, vma->vm_mm); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, old_end); adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
@@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) continue;
if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {shared_pmd = true;
if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) { old_addr |= last_addr_mask; new_addr |= last_addr_mask; continue;}
tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); if (!dst_pte)
@@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); }
- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, old_end - len, old_end);
tlb_flush_mmu_tlbonly(&tlb);
huge_pmd_unshare_flush(&tlb, vma);
mmu_notifier_invalidate_range_end(&range); i_mmap_unlock_write(mapping); hugetlb_vma_unlock_write(vma);
tlb_finish_mmu(&tlb);
return len + old_addr - old_end;
} @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long sz = huge_page_size(h); bool adjust_reservation; unsigned long last_addr_mask;
bool force_flush = false;
WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON(start & ~huge_page_mask(h));
@@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, }
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, address, ptep)) {
if (huge_pmd_unshare(tlb, vma, address, ptep)) { spin_unlock(ptl);
tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); }force_flush = true; address |= last_addr_mask; continue;@@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, } tlb_end_vma(tlb, vma);
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (force_flush)
tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
}
void __hugetlb_zap_begin(struct vm_area_struct *vma, @@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, } #endif /* CONFIG_USERFAULTFD */
-long hugetlb_change_protection(struct vm_area_struct *vma, +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { @@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte_t pte; struct hstate *h = hstate_vma(vma); long pages = 0, psize = huge_page_size(h);
- bool shared_pmd = false; struct mmu_notifier_range range; unsigned long last_addr_mask; bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, } } ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, address, ptep)) {
if (huge_pmd_unshare(tlb, vma, address, ptep)) { /* * When uffd-wp is enabled on the vma, unshare * shouldn't happen at all. Warn about it if it@@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); pages++; spin_unlock(ptl);
}shared_pmd = true; address |= last_addr_mask; continue;@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
}tlb_remove_huge_tlb_entry(h, tlb, ptep, address);next: spin_unlock(ptl); cond_resched(); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma); /*
- No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
- downgrading page table protection not changing it to point to a new
@@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, return pte; }
-/*
- unmap huge page backed by shared pte.
+/**
- huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users
- @tlb: the current mmu_gather.
- @vma: the vma covering the pmd table.
- @addr: the address we are trying to unshare.
- @ptep: pointer into the (pmd) page table.
- Called with the page table lock held, the i_mmap_rwsem held in write mode
- and the hugetlb vma lock held in write mode.
- Called with page table lock held.
- Note: The caller must call huge_pmd_unshare_flush() before dropping the
- i_mmap_rwsem.
- returns: 1 successfully unmapped a shared pte page
0 the underlying pte page is not shared, or it is the last user
- Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it
*/
was not a shared PMD table.-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep){ unsigned long sz = huge_page_size(hstate_vma(vma));
- struct mm_struct *mm = vma->vm_mm; pgd_t *pgd = pgd_offset(mm, addr); p4d_t *p4d = p4d_offset(pgd, addr); pud_t *pud = pud_offset(p4d, addr);
@@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, i_mmap_assert_write_locked(vma->vm_file->f_mapping); hugetlb_vma_assert_locked(vma); pud_clear(pud);
- /*
* Once our caller drops the rmap lock, some other process might be* using this page table as a normal, non-hugetlb page table.* Wait for pending gup_fast() in other threads to finish before letting* that happen.*/- tlb_remove_table_sync_one();
- ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
- tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr);
- mm_dec_nr_pmds(mm); return 1;
}
+/*
- huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
- @tlb: the current mmu_gather.
- @vma: the vma covering the pmd table.
- Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
- unsharing with concurrent page table walkers.
- This function must be called after a sequence of huge_pmd_unshare()
- calls while still holding the i_mmap_rwsem.
- */
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) +{
- /*
* We must synchronize page table unsharing such that nobody will* try reusing a previously-shared page table while it might still* be in use by previous sharers (TLB, GUP_fast).*/- i_mmap_assert_write_locked(vma->vm_file->f_mapping);
- tlb_flush_unshared_tables(tlb);
+}
#else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, return NULL; }
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep){ return 0; }
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ +}
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long sz = huge_page_size(h); struct mm_struct *mm = vma->vm_mm; struct mmu_notifier_range range;
- struct mmu_gather tlb; unsigned long address; spinlock_t *ptl; pte_t *ptep;
@@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, if (start >= end) return;
- tlb_gather_mmu(&tlb, mm); flush_cache_range(vma, start, end); /*
- No need to call adjust_range_if_pmd_sharing_possible(), because
@@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, if (!ptep) continue; ptl = huge_pte_lock(h, mm, ptep);
huge_pmd_unshare(mm, vma, address, ptep);
spin_unlock(ptl); }huge_pmd_unshare(&tlb, vma, address, ptep);
- flush_hugetlb_tlb_range(vma, start, end);
- huge_pmd_unshare_flush(&tlb, vma); if (take_locks) { i_mmap_unlock_write(vma->vm_file->f_mapping); hugetlb_vma_unlock_write(vma);
@@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, * Documentation/mm/mmu_notifier.rst. */ mmu_notifier_invalidate_range_end(&range);
- tlb_finish_mmu(&tlb);
}
/* diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 247e3f9db6c7a..030a162a263ba 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -426,6 +426,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, #endif tlb->vma_pfn = 0;
- tlb->fully_unshared_tables = 0; __tlb_reset_range(tlb); inc_tlb_flush_pending(tlb->mm);
} @@ -468,6 +469,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm) */ void tlb_finish_mmu(struct mmu_gather *tlb) {
- /*
* We expect an earlier huge_pmd_unshare_flush() call to sort this out,* due to complicated locking requirements with page table unsharing.*/- VM_WARN_ON_ONCE(tlb->fully_unshared_tables);
- /*
- If there are parallel threads are doing PTE changes on same range
- under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
diff --git a/mm/mprotect.c b/mm/mprotect.c index 283889e4f1cec..5c330e817129e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb, #endif
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot,
else pages = change_protection_range(tlb, vma, start, end, newprot,pages = hugetlb_change_protection(tlb, vma, start, end, newprot, cp_flags);diff --git a/mm/rmap.c b/mm/rmap.c index 748f48727a162..d6799afe11147 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -76,7 +76,7 @@ #include <linux/mm_inline.h> #include <linux/oom.h>
-#include <asm/tlbflush.h> +#include <asm/tlb.h>
#define CREATE_TRACE_POINTS #include <trace/events/migrate.h> @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * if unsuccessful. */ if (!anon) {
struct mmu_gather tlb;VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); if (!hugetlb_vma_trylock_write(vma)) goto walk_abort;
if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
tlb_gather_mmu(&tlb, mm);if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma);
flush_tlb_range(vma,range.start, range.end);
huge_pmd_unshare_flush(&tlb, vma);tlb_finish_mmu(&tlb); /* * The PMD table was unmapped, * consequently unmapping the folio.@@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_done; } hugetlb_vma_unlock_write(vma);
tlb_finish_mmu(&tlb); } pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); if (pte_dirty(pteval))@@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * fail if unsuccessful. */ if (!anon) {
struct mmu_gather tlb;VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); if (!hugetlb_vma_trylock_write(vma)) { page_vma_mapped_walk_done(&pvmw); ret = false; break; }
if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {hugetlb_vma_unlock_write(vma);flush_tlb_range(vma,range.start, range.end);
tlb_gather_mmu(&tlb, mm);if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {hugetlb_vma_unlock_write(vma);huge_pmd_unshare_flush(&tlb, vma);tlb_finish_mmu(&tlb); /* * The PMD table was unmapped, * consequently unmapping the folio.@@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, break; } hugetlb_vma_unlock_write(vma);
tlb_finish_mmu(&tlb); } /* Nuke the hugetlb page table entry */ pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);-- 2.52.0
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
}tlb_remove_huge_tlb_entry(h, tlb, ptep, address);next: spin_unlock(ptl); cond_resched(); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call flush_hugetlb_tlb_range() instead of ordinary TLB flush routine, otherwise it will break ARCHes that has "special requirements" for evicting hugetlb backing TLB entries?
/* * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are * downgrading page table protection not changing it to point to a new
On 12/19/25 13:37, Harry Yoo wrote:
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
}tlb_remove_huge_tlb_entry(h, tlb, ptep, address);next: spin_unlock(ptl); cond_resched(); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call
I hope not :) In the worst case we could keep the flush_hugetlb_tlb_range() in the !shared case in. Suboptimal but I am sick and tired of dealing with this hugetlb mess.
Let me CC Ryan and Catalin for the arm64 pieces and Christophe on the ppc pieces: See [1] where we convert away from some flush_hugetlb_tlb_range() users to operate on mmu_gather using * tlb_remove_huge_tlb_entry() for mremap() and mprotect(). Before we would only use it in __unmap_hugepage_range(). * tlb_flush_pmd_range() for unsharing of shared PMD tables. We already used that in one call path.
[1] https://lore.kernel.org/all/20251212071019.471146-5-david@kernel.org/
flush_hugetlb_tlb_range() instead of ordinary TLB flush routine, otherwise it will break ARCHes that has "special requirements" for evicting hugetlb backing TLB entries?
Yeah, I was briefly wondering about that myself (and the inconsistency we had in the code). I would hope that we're good, but maybe there are some nasty corner cases we're missing. So thanks for raising that.
Given tlb_remove_huge_tlb_entry() exist (and is already getting used) I would assume that it does the right thing.
In tlb_unshare_pmd_ptdesc(), I am now using tlb_flush_pmd_range(), because we know that we are dealing with PMD-sized hugetlb folios.
And in fact, we were already doing that in case of __unmap_hugepage_range(), where we did exactly what I do now:
tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
So, again, something would already be broken there unless I am missing something important.
Looking at it, I wonder whether we must do the tlb_remove_huge_tlb_entry() in move_hugetlb_page_tables() after the move_huge_pte(). Looks like tlb_remove_huge_tlb_entry() might do some flushing on ppc (and not just updating the mmu_gather) through __tlb_remove_tlb_entry(). But it's a bit confusing.
On 12/19/25 14:52, David Hildenbrand (Red Hat) wrote:
On 12/19/25 13:37, Harry Yoo wrote:
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
next: spin_unlock(ptl); cond_resched(); }tlb_remove_huge_tlb_entry(h, tlb, ptep, address); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call
I hope not :) In the worst case we could keep the flush_hugetlb_tlb_range() in the !shared case in. Suboptimal but I am sick and tired of dealing with this hugetlb mess.
Let me CC Ryan and Catalin for the arm64 pieces and Christophe on the ppc pieces: See [1] where we convert away from some flush_hugetlb_tlb_range() users to operate on mmu_gather using
- tlb_remove_huge_tlb_entry() for mremap() and mprotect(). Before we would only use it in __unmap_hugepage_range().
- tlb_flush_pmd_range() for unsharing of shared PMD tables. We already used that in one call path.
To clarify, powerpc does not select ARCH_WANT_HUGE_PMD_SHARE, so the second change does not apply to ppc.
On 12/19/25 14:59, David Hildenbrand (Red Hat) wrote:
On 12/19/25 14:52, David Hildenbrand (Red Hat) wrote:
On 12/19/25 13:37, Harry Yoo wrote:
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
next: spin_unlock(ptl); cond_resched(); }tlb_remove_huge_tlb_entry(h, tlb, ptep, address); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call
I hope not :) In the worst case we could keep the flush_hugetlb_tlb_range() in the !shared case in. Suboptimal but I am sick and tired of dealing with this hugetlb mess.
Let me CC Ryan and Catalin for the arm64 pieces and Christophe on the ppc pieces: See [1] where we convert away from some flush_hugetlb_tlb_range() users to operate on mmu_gather using
- tlb_remove_huge_tlb_entry() for mremap() and mprotect(). Before we would only use it in __unmap_hugepage_range().
- tlb_flush_pmd_range() for unsharing of shared PMD tables. We already used that in one call path.
To clarify, powerpc does not select ARCH_WANT_HUGE_PMD_SHARE, so the second change does not apply to ppc.
Okay, the existing hugetlb mmu_gather integration is hell on earth.
I *think* to get everything right (work around all the hacks we have) we might have to do a
tlb_change_page_size(tlb, sz); tlb_start_vma(tlb, vma);
before adding something to the tlb and a tlb_end_vma(tlb, vma) if we don't immediately call tlb_finish_mmu() already.
tlb_change_page_size() will set page_size accordingly (as required for ppc IIUC).
tlb_start_vma()->tlb_update_vma_flags() will set tlb->vma_huge for ... some very good reason I am sure.
So something like the following might do the trick:
From b0b854c2f91ce0931e1462774c92015183fb5b52 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" david@kernel.org Date: Sun, 21 Dec 2025 12:57:43 +0100 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org --- mm/hugetlb.c | 12 +++++++++++- mm/rmap.c | 4 ++++ 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7fef0b94b5d1e..14521210181c9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5113,6 +5113,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, /* Prevent race with file truncation */ hugetlb_vma_lock_write(vma); i_mmap_lock_write(mapping); + + tlb_change_page_size(&tlb, sz); + tlb_start_vma(&tlb, vma); for (; old_addr < old_end; old_addr += sz, new_addr += sz) { src_pte = hugetlb_walk(vma, old_addr, sz); if (!src_pte) { @@ -5128,13 +5131,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, new_addr |= last_addr_mask; continue; } - tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); if (!dst_pte) break;
move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr); }
tlb_flush_mmu_tlbonly(&tlb); @@ -6416,6 +6419,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm
BUG_ON(address >= end); flush_cache_range(vma, range.start, range.end); + tlb_change_page_size(tlb, psize); + tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(&range); hugetlb_vma_lock_write(vma); @@ -6532,6 +6537,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm hugetlb_vma_unlock_write(vma); mmu_notifier_invalidate_range_end(&range);
+ tlb_end_vma(tlb, vma); + return pages > 0 ? (pages << h->order) : pages; }
@@ -7259,6 +7266,9 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, } else { i_mmap_assert_write_locked(vma->vm_file->f_mapping); } + + tlb_change_page_size(&tlb, sz); + tlb_start_vma(&tlb, vma); for (address = start; address < end; address += PUD_SIZE) { ptep = hugetlb_walk(vma, address, sz); if (!ptep) diff --git a/mm/rmap.c b/mm/rmap.c index d6799afe11147..27210bc6fb489 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2015,6 +2015,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_abort;
tlb_gather_mmu(&tlb, mm); + tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma))); + tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma); @@ -2413,6 +2415,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, }
tlb_gather_mmu(&tlb, mm); + tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma))); + tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma);
On Sun, Dec 21, 2025 at 01:24:44PM +0100, David Hildenbrand (Red Hat) wrote:
On 12/19/25 14:59, David Hildenbrand (Red Hat) wrote:
On 12/19/25 14:52, David Hildenbrand (Red Hat) wrote:
On 12/19/25 13:37, Harry Yoo wrote:
On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
next: spin_unlock(ptl); cond_resched(); }tlb_remove_huge_tlb_entry(h, tlb, ptep, address); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call
I hope not :) In the worst case we could keep the flush_hugetlb_tlb_range() in the !shared case in. Suboptimal but I am sick and tired of dealing with this hugetlb mess.
Let me CC Ryan and Catalin for the arm64 pieces and Christophe on the ppc pieces: See [1] where we convert away from some flush_hugetlb_tlb_range() users to operate on mmu_gather using
- tlb_remove_huge_tlb_entry() for mremap() and mprotect(). Before we would only use it in __unmap_hugepage_range().
- tlb_flush_pmd_range() for unsharing of shared PMD tables. We already used that in one call path.
To clarify, powerpc does not select ARCH_WANT_HUGE_PMD_SHARE, so the second change does not apply to ppc.
Okay, the existing hugetlb mmu_gather integration is hell on earth.
I *think* to get everything right (work around all the hacks we have) we might have to do a
tlb_change_page_size(tlb, sz); tlb_start_vma(tlb, vma);
before adding something to the tlb and a tlb_end_vma(tlb, vma) if we don't immediately call tlb_finish_mmu() already.
Good point, indeed!
tlb_change_page_size() will set page_size accordingly (as required for ppc IIUC).
Right. PPC wants to flush TLB when the page size changes.
tlb_start_vma()->tlb_update_vma_flags() will set tlb->vma_huge for ... some very good reason I am sure.
:)
So something like the following might do the trick:
From b0b854c2f91ce0931e1462774c92015183fb5b52 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" david@kernel.org Date: Sun, 21 Dec 2025 12:57:43 +0100 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
mm/hugetlb.c | 12 +++++++++++- mm/rmap.c | 4 ++++ 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7fef0b94b5d1e..14521210181c9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5113,6 +5113,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, /* Prevent race with file truncation */ hugetlb_vma_lock_write(vma); i_mmap_lock_write(mapping);
- tlb_change_page_size(&tlb, sz);
- tlb_start_vma(&tlb, vma); for (; old_addr < old_end; old_addr += sz, new_addr += sz) { src_pte = hugetlb_walk(vma, old_addr, sz); if (!src_pte) {
@@ -5128,13 +5131,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, new_addr |= last_addr_mask; continue; }
dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); if (!dst_pte) break; move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
} tlb_flush_mmu_tlbonly(&tlb);tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);@@ -6416,6 +6419,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm BUG_ON(address >= end); flush_cache_range(vma, range.start, range.end);
- tlb_change_page_size(tlb, psize);
- tlb_start_vma(tlb, vma); mmu_notifier_invalidate_range_start(&range); hugetlb_vma_lock_write(vma);
@@ -6532,6 +6537,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm hugetlb_vma_unlock_write(vma); mmu_notifier_invalidate_range_end(&range);
- tlb_end_vma(tlb, vma);
- return pages > 0 ? (pages << h->order) : pages;
} @@ -7259,6 +7266,9 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, } else { i_mmap_assert_write_locked(vma->vm_file->f_mapping); }
- tlb_change_page_size(&tlb, sz);
- tlb_start_vma(&tlb, vma); for (address = start; address < end; address += PUD_SIZE) { ptep = hugetlb_walk(vma, address, sz); if (!ptep)
diff --git a/mm/rmap.c b/mm/rmap.c index d6799afe11147..27210bc6fb489 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2015,6 +2015,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_abort; tlb_gather_mmu(&tlb, mm);
tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma);@@ -2413,6 +2415,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, } tlb_gather_mmu(&tlb, mm);
tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma);-- 2.52.0
But now I'm staring at it and wonder whether we should just defer the TLB flushing changes to a later point and only focus on the IPI flushes.
You mean defer TLB flushing to which point? For unmapping or changing permission of VMAs, flushing at VMA boundary already makes sense?
Or if you meant batching TLB flushes in try_to_{migrate,unmap}_one()...
/me starts wondering...
"Hmm... for RMAP, we already have TLB flush batching via struct tlbflush_unmap_batch. Why not use this framework when unmapping shared hugetlb pages as well?"
Doing only that with mmu_gather looks *really* weird, and I don't want to introduce some other mechanism just for that batching purpose.
Hm ...
-- Cheers
David
Okay, the existing hugetlb mmu_gather integration is hell on earth.
I *think* to get everything right (work around all the hacks we have) we might have to do a
tlb_change_page_size(tlb, sz); tlb_start_vma(tlb, vma);
before adding something to the tlb and a tlb_end_vma(tlb, vma) if we don't immediately call tlb_finish_mmu() already.
Good point, indeed!
tlb_change_page_size() will set page_size accordingly (as required for ppc IIUC).
Right. PPC wants to flush TLB when the page size changes.
tlb_start_vma()->tlb_update_vma_flags() will set tlb->vma_huge for ... some very good reason I am sure.
:)
So something like the following might do the trick:
From b0b854c2f91ce0931e1462774c92015183fb5b52 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Red Hat)" david@kernel.org Date: Sun, 21 Dec 2025 12:57:43 +0100 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
mm/hugetlb.c | 12 +++++++++++- mm/rmap.c | 4 ++++ 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7fef0b94b5d1e..14521210181c9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5113,6 +5113,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, /* Prevent race with file truncation */ hugetlb_vma_lock_write(vma); i_mmap_lock_write(mapping);
- tlb_change_page_size(&tlb, sz);
- tlb_start_vma(&tlb, vma); for (; old_addr < old_end; old_addr += sz, new_addr += sz) { src_pte = hugetlb_walk(vma, old_addr, sz); if (!src_pte) {
@@ -5128,13 +5131,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, new_addr |= last_addr_mask; continue; }
dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); if (!dst_pte) break; move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
} tlb_flush_mmu_tlbonly(&tlb);tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);@@ -6416,6 +6419,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm BUG_ON(address >= end); flush_cache_range(vma, range.start, range.end);
- tlb_change_page_size(tlb, psize);
- tlb_start_vma(tlb, vma); mmu_notifier_invalidate_range_start(&range); hugetlb_vma_lock_write(vma);
@@ -6532,6 +6537,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm hugetlb_vma_unlock_write(vma); mmu_notifier_invalidate_range_end(&range);
- tlb_end_vma(tlb, vma);
- return pages > 0 ? (pages << h->order) : pages; }
@@ -7259,6 +7266,9 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, } else { i_mmap_assert_write_locked(vma->vm_file->f_mapping); }
- tlb_change_page_size(&tlb, sz);
- tlb_start_vma(&tlb, vma); for (address = start; address < end; address += PUD_SIZE) { ptep = hugetlb_walk(vma, address, sz); if (!ptep)
diff --git a/mm/rmap.c b/mm/rmap.c index d6799afe11147..27210bc6fb489 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2015,6 +2015,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_abort; tlb_gather_mmu(&tlb, mm);
tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma);@@ -2413,6 +2415,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, } tlb_gather_mmu(&tlb, mm);
tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));tlb_start_vma(&tlb, vma); if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); huge_pmd_unshare_flush(&tlb, vma);-- 2.52.0
But now I'm staring at it and wonder whether we should just defer the TLB flushing changes to a later point and only focus on the IPI flushes.
You mean defer TLB flushing to which point? For unmapping or changing permission of VMAs, flushing at VMA boundary already makes sense?
Defer converting to mmu_gather to a later patch set :)
I gave it a try yesterday, but it's also a bit ugly.
In the code above, primarily the rmap change is nasty.
Or if you meant batching TLB flushes in try_to_{migrate,unmap}_one()...
/me starts wondering...
"Hmm... for RMAP, we already have TLB flush batching via struct tlbflush_unmap_batch. Why not use this framework when unmapping shared hugetlb pages as well?"
Hm, also not what we really want in most cases. I don't think we should be using that outside of rmap.c (and I have the gut feeling that we should maybe make use of mmu_gather in there instead at some point).
Let me try a bit to see if I can clean the code here up, or if I just add a temporary custom batching data structure.
Thanks for bringing this up!
linux-stable-mirror@lists.linaro.org