On Sat, May 31, 2025 at 6:36 PM Kairui Song ryncsn@gmail.com wrote:
On Sat, May 31, 2025 at 2:10 PM Lokesh Gidra lokeshgidra@google.com wrote:
On Fri, May 30, 2025 at 9:42 PM Barry Song 21cnbao@gmail.com wrote:
On Sat, May 31, 2025 at 4:04 PM Barry Song 21cnbao@gmail.com wrote:
On Sat, May 31, 2025 at 8:17 AM Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and try to move the found folio to the faulting vma when. Currently, it relies on the PTE value check to ensure the moved folio still belongs to the src swap entry, which turns out is not reliable.
While working and reviewing the swap table series with Barry, following existing race is observed and reproduced [1]:
( move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < Somehow interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is just a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B could use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < Somehow interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry pte // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!!
The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced [1].
It's also possible that folio (A) is swapped in, and swapped out again after the filemap_get_folio lookup, in such case folio (A) may stay in swap cache so it needs to be moved too. In this case we should also try again so kernel won't miss a folio move.
Fix this by checking if the folio is the valid swap cache folio after acquiring the folio lock, and checking the swap cache again after acquiring the src_pte lock.
SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far we don't need to worry about that since folios only might get exposed to swap cache in the swap out path, and it's covered in this patch too by checking the swap cache again after acquiring src_pte lock.
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt... [1] Signed-off-by: Kairui Song kasong@tencent.com
mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..a1564d205dfb 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -15,6 +15,7 @@ #include <linux/mmu_notifier.h> #include <linux/hugetlb.h> #include <linux/shmem_fs.h> +#include <linux/delay.h> #include <asm/tlbflush.h> #include <asm/tlb.h> #include "internal.h" @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, spinlock_t *dst_ptl, spinlock_t *src_ptl, struct folio *src_folio) {
swp_entry_t entry;
double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (src_folio) { folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr);
} else {
/*
* Check again after acquiring the src_pte lock. Or we might
* miss a new loaded swap cache folio.
*/
entry = pte_to_swp_entry(orig_src_pte);
src_folio = filemap_get_folio(swap_address_space(entry),
swap_cache_index(entry));
if (!IS_ERR_OR_NULL(src_folio)) {
double_pt_unlock(dst_ptl, src_ptl);
folio_put(src_folio);
return -EAGAIN;
} }
step 1: src pte points to a swap entry without swapcache step 2: we call move_swap_pte() step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry have swapcache again - for non-sync/non-zRAM swap device; step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte; step 5: do_swap_page() for src_pte holds the ptl and found pte has been cleared in step 4; pte_same() returns false; step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3; if the swapcache folio is dropped, it seems everything is fine.
So the real issue is that do_swap_page() doesn’t drop the new swapcache even when pte_same() returns false? That means the dst_pte swap-in can still hit the swap cache entry brought in by the src_pte's swap-in?
It seems also possible for the sync zRAM device.
step 1: src pte points to a swap entry S without swapcache step 2: we call move_swap_pte() step 3: someone swap-in src_pte by sync path, no swapcache; swap slot S is freed. -- for zRAM; step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1, adds folio to swapcache due to swapout; step 5: move_swap_pte() gets ptl and finds page tables are stable since swap-out happens to have the same swap slot as step1; step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
Yep, we really need to re-check pte for swapcache after holding PTL.
Any idea what is the overhead of filemap_get_folio()? In particular, when no folio exists for the given entry, how costly is it?
Given how rare it is, unless filemap_get_folio() is cheap for 'no folio' case, if there is no way to avoid calling it after holding PTL, then we should do it only once at that point. If a folio is returned, then like in the pte_present() case, we attempt folio_trylock() with PTL held, otherwise do the retry dance.
Yeah I think filemap_get_folio is cheap, each swap cache space is at most 64M big, so it just walks at most three xa_nodes and returns, not involving any synchronization or write.
The swap cache lookup will be even cheaper in the future to be just checking one plain array element.
I can try to fix this with the folio_trylock inside the PTL lock approach, maybe the code will be cleaner that way.
Using trylock in an atomic context and deferring to a full lock in a non-atomic context might be considered a workaround — it feels a bit hackish to me :-)
To follow that, the reader might need to understand the entire workflow from kernel to userspace to grasp what's happening with the folio lock, whereas the current code handles everything within the kernel, making it more natural.
Thanks Barry