On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand david@redhat.com wrote:
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
Hm, I'm a bit lost... which page table walk implementation are you worried about that accesses page tables purely with RCU? I believe all page table walks should be happening either with interrupts off (in gup_fast()) or under the protection of higher-level locks; in particular, hugetlb page walks take an extra hugetlb specific lock (for hugetlb VMAs that are eligible for page table sharing, that is the rw_sema in hugetlb_vma_lock).
Regarding gup_fast():
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix commit 1013af4f585f uses a synchronous IPI with tlb_remove_table_sync_one() to wait for any concurrent GUP-fast software page table walks, and some time after the call to huge_pmd_unshare() we will do a TLB flush that synchronizes against hardware page table walks.
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I believe the expectation is that the TLB flush implicitly does an IPI which synchronizes against both software and hardware page table walks.