Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

19 Nov 2025


      ...
...
So what I am currently looking into is simply reducing (batching) the number
of IPIs.
As in the IPIs we are now generating in tlb_remove_table_sync_one()?
Or something else?
Yes, for now. I'm essentially reducing the number of 
tlb_remove_table_sync_one() calls.
...
As this bug is only an issue when we don't use IPIs for pgtable freeing right
(e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
tlb_remove_table_sync_one() is a no-op?
Right. But it's still confusing: I think for page table unsharing we 
always need an IPI one way or the other to make sure GUP-fast was called.
At least for preventing that anybody would be able to reuse the page 
table in the meantime.
That is either:
(a) The TLB shootdown implied an IPI
(b) We manually send one
But that's where it gets confusing: nowadays x86 also selects 
MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
This is so complicated, so I might be missing something.
But it's the same behavior we have in collapse_huge_page() where we first
...
...
In essence, we only have to send one IPI when unsharing multiple page
tables, and we only have to send one when we are the last one sharing the
page table (before it can get reused).
Right, hopefully that significantly cuts down on the amount genrated.
I'd assume that the problem of the current approach is that when we fork 
a child and it quits, that we call __unmap_hugepage_range(). If the 
range is large enough to cover many PMD tables (multiple gigabytes?), we
essentially send one IPI per PMD table we are unsharing, when we really 
only have to send one.
That's the theory ...
-- 
Cheers

David

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race