Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

20 Oct 2025

      Jann,
Please bear with my questions below, want to get a good mental model of this. :)
Thanks!
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
...
On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand david@redhat.com wrote:
...
On 01.09.25 12:58, Jann Horn wrote:
...
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
...
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an
extra synchronous IPI broadcast to all CPU cores, so it's not very
surprising that doing this would be a bit slow on a machine with 196
cores.
...
My observation/assumption is:
each child touches 100 random pages and despawns
on each despawn `huge_pmd_unshare()` is called
each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe
changing tlb_remove_table_sync_one() to rely on the MM's cpumask
(though that would require thinking about whether this interacts with
remote MM access somehow), or batching the refcount drops for hugetlb
shared page tables through something like struct mmu_gather, or doing
something special for the unmap path, or changing the semantics of
hugetlb page tables such that they can never turn into normal page
tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an
explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal,
non-hugetlb". I could only see how it is used for the remaining user for
hugetlb stuff, but that's different question)
If I remember correctly:
When a hugetlb shared page table drops to refcount 1, it turns into a
normal page table. If you then afterwards split the hugetlb VMA, unmap
one half of it, and place a new unrelated VMA in its place, the same
page table will be reused for PTEs of this new unrelated VMA.
So the scenario would be:

Initially, we have a hugetlb shared page table covering 1G of

address space which maps hugetlb 2M pages, which is used by two
hugetlb VMAs in different processes (processes P1 and P2).
2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
walks down through the PUD entry that points to the shared page table,
then when it reaches the loop in gup_fast_pmd_range() gets interrupted
for a while by an NMI or preempted by the hypervisor or something.
3. P2 removes its VMA, and the hugetlb shared page table effectively
becomes a normal page table in P1.
This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
P2/T1 on CPU A is doing the gup_fast() walk,
P2/T2 on CPU B is simultaneously 'removing' this VMA?
Because surely the interrupts being disabled on CPU A means that ordinary
preemption won't happen right?
By remove what do you mean? Unmap? But won't this result in a TLB flush synced
by IPI that is stalled by P2'S CPU having interrupts diabled?
Or is it removed in the sense of hugetlb? As in something that invokes
huge_pmd_unshare()?
But I guess this doesn't matter as the page table teardown will succeed, just
the final tlb_finish_mmu() will stall.
And I guess GUP fast is trying to protect against the clear down by checking pmd
!= *pmdp.
...

Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),

leaving two VMAs VMA1 and VMA2.
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
for IPI-synced architectures, and in that case the unmap won't finish and the
mmap write lock won't be released so nobody an map a new VMA yet can they?
...

P1 populates VMA3 with page table entries.

ofc this requires the mmap/vma write lock above to be released first.
...

The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now

uses the new PMD/PTE entries created for VMA3.
...
How does the fix work when an architecture does not issue IPIs for TLB
shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
...
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD
changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
     unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
        gup_put_folio(folio, 1, flags);
        goto pte_unmap;
}
So in case the page table got reused in the meantime, we should just
back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check
whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
Cheers, Lorenzo

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race