Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

24 Oct 2025

      On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes
lorenzo.stoakes@oracle.com wrote:
...
On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
...
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
lorenzo.stoakes@oracle.com wrote:
...
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
...

Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),

leaving two VMAs VMA1 and VMA2.
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
for IPI-synced architectures, and in that case the unmap won't finish and the
mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI
for TLB synchronization. My patch also doesn't change anything on
those architectures - tlb_remove_table_sync_one() is a no-op on
architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Hmm but in that case wouldn't:
tlb_finish_mmu()
-> tlb_flush_mmu()
-> tlb_flush_mmu_free()
-> tlb_table_flush()
And then from there we call tlb_remove_table_free(), which does a
call_rcu() to tlb_remove_table_rcu(), which will asynchronously run
later and do __tlb_remove_table_free(), which does
__tlb_remove_table()?
...
-> tlb_remove_table()
I don't see any way we end up in tlb_remove_table() from here.
tlb_remove_table() is a much higher-level function, we end up there
from something like pte_free_tlb(). I think you mixed up
tlb_remove_table_free and tlb_remove_table.
...
-> __tlb_remove_table_one()
Heh, I think you made the same mistake as Linus made years ago when he
was looking at tlb_remove_table(). In that function, the call to
tlb_remove_table_one() leading to __tlb_remove_table_one() **is a
slowpath only taken when memory allocation fails** - it's a fallback
from the normal path that queues up batch items in (*batch)->tables[]
(and occasionally calls tlb_table_flush() when it runs out of space in
there).
...
-> tlb_remove_table_sync_one()
prevent the unmapping on non-IPI architectures, thereby mitigating the
issue?
...
Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used
for page table teardown whose grace period would be disallowed until
gup_fast() finishes and therefore that also mitigate?
I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE
implies that "Semi RCU" is used to protect page table *freeing*, but
page table freeing is irrelevant to this bug, and there is no RCU
delay involved in dropping a reference on a shared hugetlb page table.
"Semi RCU" is not used to protect against page table *reuse* at a
different address by THP. Also, as explained in the big comment block
in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used -
when memory allocations fail, the __tlb_remove_table_one() fallback
path, when used on !PT_RECLAIM, will fall back to an IPI broadcast
followed by directly freeing the page table. RCU is just used as the
more polite way to do something equivalent to an IPI broadcast (RCU
will wait for other cores to go through regions where they _could_
receive an IPI as part of RCU-sched).
But also: At which point would you expect any page table to actually
be freed, triggering any of this logic? When unmapping VMA1 in step 5,
I think there might not be any page tables that exist and are fully
covered by VMA1 (or its adjacent free space, if there is any) so that
they are eligible to be freed.
...
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs
before the page table becomes reusable in another process.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race