On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table.
[...snip...]
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Reported-by: Uschakow, Stanislav" suschako@amazon.de Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman loberman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand (Red Hat) david@kernel.org
include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- include/linux/hugetlb.h | 19 +++--- mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- mm/mmu_gather.c | 7 +++ mm/mprotect.c | 2 +- mm/rmap.c | 25 +++++--- 6 files changed, 179 insertions(+), 69 deletions(-)
@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, pte = huge_pte_clear_uffd_wp(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++;
}tlb_remove_huge_tlb_entry(h, tlb, ptep, address);next: spin_unlock(ptl); cond_resched(); }
- /*
* There is nothing protecting a previously-shared page table that we* unshared through huge_pmd_unshare() from getting freed after we* release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()* succeeded, flush the range corresponding to the pud.*/- if (shared_pmd)
flush_hugetlb_tlb_range(vma, range.start, range.end);- else
flush_hugetlb_tlb_range(vma, start, end);
- tlb_flush_mmu_tlbonly(tlb);
- huge_pmd_unshare_flush(tlb, vma);
Shouldn't we teach mmu_gather that it has to call flush_hugetlb_tlb_range() instead of ordinary TLB flush routine, otherwise it will break ARCHes that has "special requirements" for evicting hugetlb backing TLB entries?
/* * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are * downgrading page table protection not changing it to point to a new