I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
For that, we need the back-off described below. IIRC we implemented that in the PMD case for khugepaged.
Or is there somewhere a guaranteed RCU sync before the shared page table gets reused?
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Yes, see my follow-up mail, that's what we'd have to add.
On an arch without IPI, page tables will be freed with RCU and it just works. We walk the wrong page table, realize that the PUD changed and back off.
On an arch with IPI it's tricky: if we don't issue the IPI you added, we might still back off once we check the PUD entry didn't changee, but I'm afraid nothing would stop us from walking the previous page table that was freed in the meantime, containing garbage.
Easy fix would be never reusing a page table once shared once?