Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

16 Oct 2025


      ...
...
I'm currently looking at the fix and what sticks out is "Fix it with an
explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal,
non-hugetlb". I could only see how it is used for the remaining user for
hugetlb stuff, but that's different question)
If I remember correctly:
When a hugetlb shared page table drops to refcount 1, it turns into a
normal page table. If you then afterwards split the hugetlb VMA, unmap
one half of it, and place a new unrelated VMA in its place, the same
page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
...
So the scenario would be:

Initially, we have a hugetlb shared page table covering 1G of

address space which maps hugetlb 2M pages, which is used by two
hugetlb VMAs in different processes (processes P1 and P2).
2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
walks down through the PUD entry that points to the shared page table,
then when it reaches the loop in gup_fast_pmd_range() gets interrupted
for a while by an NMI or preempted by the hypervisor or something.
3. P2 removes its VMA, and the hugetlb shared page table effectively
becomes a normal page table in P1.
4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2.
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
6. P1 populates VMA3 with page table entries.
7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
...
...
How does the fix work when an architecture does not issue IPIs for TLB
shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has 
been freed+reused in the meantime (prevent us from de-referencing 
garbage entries).
It does not prevent walking the now-unshared page table that has been 
modified by the other process.
For that, we need the back-off described below. IIRC we implemented that 
in the PMD case for khugepaged.
Or is there somewhere a guaranteed RCU sync before the shared page table 
gets reused?
...
...
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD
changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
         gup_put_folio(folio, 1, flags);
         goto pte_unmap;
}
So in case the page table got reused in the meantime, we should just
back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check
whether the PUD entry changed here.
Yes, see my follow-up mail, that's what we'd have to add.
On an arch without IPI, page tables will be freed with RCU and it just 
works. We walk the wrong page table, realize that the PUD changed and 
back off.
On an arch with IPI it's tricky: if we don't issue the IPI you added, we 
might still back off once we check the PUD entry didn't changee, but I'm 
afraid nothing would stop us from walking the previous page table that 
was freed in the meantime, containing garbage.
Easy fix would be never reusing a page table once shared once?
-- 
Cheers

David / dhildenb

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race