On 12/3/25 18:22, Prakash Sangappa wrote:
On Nov 20, 2025, at 7:47 AM, David Hildenbrand (Red Hat) david@kernel.org wrote:
On 11/19/25 17:31, David Hildenbrand (Red Hat) wrote:
On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
So what I am currently looking into is simply reducing (batching) the number of IPIs.
As in the IPIs we are now generating in tlb_remove_table_sync_one()?
Or something else?
Yes, for now. I'm essentially reducing the number of tlb_remove_table_sync_one() calls.
As this bug is only an issue when we don't use IPIs for pgtable freeing right (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise tlb_remove_table_sync_one() is a no-op?
Right. But it's still confusing: I think for page table unsharing we always need an IPI one way or the other to make sure GUP-fast was called.
At least for preventing that anybody would be able to reuse the page table in the meantime.
That is either:
(a) The TLB shootdown implied an IPI
(b) We manually send one
But that's where it gets confusing: nowadays x86 also selects MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
This is so complicated, so I might be missing something.
But it's the same behavior we have in collapse_huge_page() where we first
... flush and then call tlb_remove_table_sync_one().
Okay, I pushed something to
https://github.com/davidhildenbrand/linux.git hugetlb_unshare
For testing had to backport the fix to v5.15. Used top 8 commits from the above tree. v5.15 kernel does not have ptdesc and hugetlb vma locking.
With that change, our DB team has verified that it fixes the regression.
Great, thanks for testing!
Will you push this fix to LTS trees after it is reviewed and merged?
I can further clean this up and send it out. There is something about the mmu_gather integration that I don't enjoy, but I didn't find a better solution so far.
I can try backporting it, I would likely have to try to minimize the prereq cleanups. Let me see to which degree this can be done in a sensible way!