On Nov 20, 2025, at 7:47 AM, David Hildenbrand (Red Hat) david@kernel.org wrote:
On 11/19/25 17:31, David Hildenbrand (Red Hat) wrote:
On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
So what I am currently looking into is simply reducing (batching) the number of IPIs.
As in the IPIs we are now generating in tlb_remove_table_sync_one()?
Or something else?
Yes, for now. I'm essentially reducing the number of tlb_remove_table_sync_one() calls.
As this bug is only an issue when we don't use IPIs for pgtable freeing right (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise tlb_remove_table_sync_one() is a no-op?
Right. But it's still confusing: I think for page table unsharing we always need an IPI one way or the other to make sure GUP-fast was called.
At least for preventing that anybody would be able to reuse the page table in the meantime.
That is either:
(a) The TLB shootdown implied an IPI
(b) We manually send one
But that's where it gets confusing: nowadays x86 also selects MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
This is so complicated, so I might be missing something.
But it's the same behavior we have in collapse_huge_page() where we first
... flush and then call tlb_remove_table_sync_one().
Okay, I pushed something to
https://github.com/davidhildenbrand/linux.git hugetlb_unshare
For testing had to backport the fix to v5.15. Used top 8 commits from the above tree. v5.15 kernel does not have ptdesc and hugetlb vma locking.
With that change, our DB team has verified that it fixes the regression.
Will you push this fix to LTS trees after it is reviewed and merged?
Thanks, -Prakash
I did a quick test and my house did not burn down. But I don't have a beefy machine to really stress+benchmark PMD table unsharing.
Could one of the original reporters (Stanislav? Prakash?) try it out to see if that would help fix the regression or if it would be a dead end?
-- Cheers
David