Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

16 Oct 2025

      On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand david@redhat.com wrote:
...
...
...
I'm currently looking at the fix and what sticks out is "Fix it with an
explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal,
non-hugetlb". I could only see how it is used for the remaining user for
hugetlb stuff, but that's different question)
If I remember correctly:
When a hugetlb shared page table drops to refcount 1, it turns into a
normal page table. If you then afterwards split the hugetlb VMA, unmap
one half of it, and place a new unrelated VMA in its place, the same
page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
...
So the scenario would be:

Initially, we have a hugetlb shared page table covering 1G of

address space which maps hugetlb 2M pages, which is used by two
hugetlb VMAs in different processes (processes P1 and P2).
2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
walks down through the PUD entry that points to the shared page table,
then when it reaches the loop in gup_fast_pmd_range() gets interrupted
for a while by an NMI or preempted by the hypervisor or something.
3. P2 removes its VMA, and the hugetlb shared page table effectively
becomes a normal page table in P1.
4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2.
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
6. P1 populates VMA3 with page table entries.
7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
...
...
How does the fix work when an architecture does not issue IPIs for TLB
shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has
been freed+reused in the meantime (prevent us from de-referencing
garbage entries).
It does not prevent walking the now-unshared page table that has been
modified by the other process.
Hm, I'm a bit lost... which page table walk implementation are you
worried about that accesses page tables purely with RCU? I believe all
page table walks should be happening either with interrupts off (in
gup_fast()) or under the protection of higher-level locks; in
particular, hugetlb page walks take an extra hugetlb specific lock
(for hugetlb VMAs that are eligible for page table sharing, that is
the rw_sema in hugetlb_vma_lock).
Regarding gup_fast():
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix
commit 1013af4f585f uses a synchronous IPI with
tlb_remove_table_sync_one() to wait for any concurrent GUP-fast
software page table walks, and some time after the call to
huge_pmd_unshare() we will do a TLB flush that synchronizes against
hardware page table walks.
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I
believe the expectation is that the TLB flush implicitly does an IPI
which synchronizes against both software and hardware page table
walks.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race