Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache

31 May 2025

On Sat, May 31, 2025 at 7:42 AM Lokesh Gidra lokeshgidra@google.com wrote:
...
On Fri, May 30, 2025 at 1:17 PM Kairui Song ryncsn@gmail.com wrote:
...
From: Kairui Song kasong@tencent.com
On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and try to move the found folio to the faulting vma when.
Currently, it relies on the PTE value check to ensure the moved folio
still belongs to the src swap entry, which turns out is not reliable.
While working and reviewing the swap table series with Barry, following
existing race is observed and reproduced [1]:
( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
 swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1
    ... < Somehow interrupted> ...
                                   <swapin src_pte, alloc and use folio A>
                                   // folio A is just a new allocated folio
                                   // and get installed into src_pte
                                   <frees swap entry S1>
                                   // src_pte now points to folio A, S1
                                   // has swap count == 0, it can be freed
                                   // by folio_swap_swap or swap
                                   // allocator's reclaim.
                                   <try to swap out another folio B>
                                   // folio B is a folio in another VMA.
                                   <put folio B to swap cache using S1 >
                                   // S1 is freed, folio B could use it
                                   // for swap out with no problem.
                                   ...
    folio = filemap_get_folio(S1)
    // Got folio B here !!!
    ... < Somehow interrupted again> ...
                                   <swapin folio B and free S1>
                                   // Now S1 is free to be used again.
                                   <swapout src_pte & folio A using S1>
                                   // Now src_pte is a swap entry pte
                                   // holding S1 again.
    folio_trylock(folio)
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // Moved invalid folio B here !!!
The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it can be
reproduced [1].
Thanks for catching and fixing this. Just to clarify a few things
about your reproducer:

Is it necessary for the 'race' mapping to be MAP_SHARED, or

MAP_PRIVATE will work as well?
Hi, I used MAP_SHARED just to prevent vma merging, so folio B and
folio A belong to different vma, so the folio_move_anon_rmap will
cause a real problem. The race is not related to the map flag.
...

You mentioned that the 'current dir is on a block device'. Are you

indicating that if we are using zram for swap then it doesn't
reproduce?
ZRAM may also trigger the race, with a slightly different race window,
but the race window is even shorter so I can't reproduce with that.
...
...
It's also possible that folio (A) is swapped in, and swapped out again
after the filemap_get_folio lookup, in such case folio (A) may stay in
swap cache so it needs to be moved too. In this case we should also try
again so kernel won't miss a folio move.
Fix this by checking if the folio is the valid swap cache folio after
acquiring the folio lock, and checking the swap cache again after
acquiring the src_pte lock.
SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
we don't need to worry about that since folios only might get exposed to
swap cache in the swap out path, and it's covered in this patch too by
checking the swap cache again after acquiring src_pte lock.
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt... [1]
Signed-off-by: Kairui Song kasong@tencent.com

mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..a1564d205dfb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -15,6 +15,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
I guess you mistakenly left it from your reproducer code :)
Ah, yes, I'll drop this :)
...
...
#include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
                         spinlock_t *dst_ptl, spinlock_t *src_ptl,
                         struct folio *src_folio)
 {

  swp_entry_t entry;


  double_pt_lock(dst_ptl, src_ptl);

  if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,



@@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
        if (src_folio) {
                folio_move_anon_rmap(src_folio, dst_vma);
                src_folio->index = linear_page_index(dst_vma, dst_addr);

  } else {


          /*


           * Check again after acquiring the src_pte lock. Or we might


           * miss a new loaded swap cache folio.


           */


          entry = pte_to_swp_entry(orig_src_pte);


          src_folio = filemap_get_folio(swap_address_space(entry),


                                        swap_cache_index(entry));



Given the non-trivial overhead of filemap_get_folio(), do you think it
will work if filemap_get_filio() was only once after locking src_ptl?
Please correct me if my assumption about the overhead is wrong.
No we can't do the filemap_get_filio after locking src_ptl, moving the
folio requires locking it, lock a folio inside a PTE lock looks a bad
idea... So I just added a lockless lookup and fallback to try again if
found one.
The overhead should be low I think, it's a lockless xarray lookup and
when I was doing many profiling of SWAP performance optimizations,
Xarray look up is not a heavy burden.
...
...

          if (!IS_ERR_OR_NULL(src_folio)) {


                  double_pt_unlock(dst_ptl, src_ptl);


                  folio_put(src_folio);


                  return -EAGAIN;


          }
  }

  orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);



@@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
                                folio_lock(src_folio);
                                goto retry;
                        }

                  /*


                   * Check if the folio still belongs to the target swap entry after


                   * acquiring the lock. Folio can be freed in the swap cache while


                   * not locked.


                   */


                  if (unlikely(!folio_test_swapcache(folio) ||


                               entry.val != folio->swap.val)) {


                          err = -EAGAIN;


                          goto out;


                  }



To avoid further increasing move_pages_pte() size, I recommend moving
the entire 'pte not present' case into move_swap_pte(), and maybe
returning some positive integer (or something more appropriate) to
handle the retry case. And then in move_swap_pte(), as suggested
above, you can do filemap_get_folio only once after locking ptl.
I think this will fix the bug as well as improve the code's organization.
...
            }
            err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
                            orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,

--
2.49.0

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache