So what I am currently looking into is simply reducing (batching) the number of IPIs.
As in the IPIs we are now generating in tlb_remove_table_sync_one()?
Or something else?
Yes, for now. I'm essentially reducing the number of tlb_remove_table_sync_one() calls.
As this bug is only an issue when we don't use IPIs for pgtable freeing right (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise tlb_remove_table_sync_one() is a no-op?
Right. But it's still confusing: I think for page table unsharing we always need an IPI one way or the other to make sure GUP-fast was called.
At least for preventing that anybody would be able to reuse the page table in the meantime.
That is either:
(a) The TLB shootdown implied an IPI
(b) We manually send one
But that's where it gets confusing: nowadays x86 also selects MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
This is so complicated, so I might be missing something.
But it's the same behavior we have in collapse_huge_page() where we first
In essence, we only have to send one IPI when unsharing multiple page tables, and we only have to send one when we are the last one sharing the page table (before it can get reused).
Right, hopefully that significantly cuts down on the amount genrated.
I'd assume that the problem of the current approach is that when we fork a child and it quits, that we call __unmap_hugepage_range(). If the range is large enough to cover many PMD tables (multiple gigabytes?), we essentially send one IPI per PMD table we are unsharing, when we really only have to send one.
That's the theory ...