Re: [PATCH v2] mm/swap: fix race when skipping swapcache

15 Feb 2024


      Hi David,
On Thu, Feb 15, 2024 at 09:03:28PM +0100, David Hildenbrand wrote:
< snip >
...
...
...
We would detect later, that the PTE changed, but we would temporarily
mess with that swap slot that we might no longer "own".
I was thinking about alternatives, it's tricky because of the concurrent
MADV_DONTNEED possibility. Something with another fake-swap entry type
(similar to migration entries) might work, but would require more changes.
Yeah, in the long term I also think more work is needed for the swap subsystem.
In my opinion, for this particular issue, or, for cache bypassed
swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
might be needed, that may even help to simplify the swap count release
routine for cache bypassed swapin, and improve the performance.
The question is if we really want to track that in the swapcache and not
rather in the page table.
Imagine the following:
(1) allocate the folio and lock it (we do that already)
(2) take the page table lock. If the PTE is still the same, insert a new
"swapin_in_process" fake swp entry that references the locked folio.
(3) read the folio from swap. This will unlock the folio IIUC. (we do that
already)
(4) relock the folio. (we do that already, might not want to fail)
(4) take the PTE lock. If the PTE did not change, turn it into a present PTE
entry. Otherwise, cleanup.
Any concurrent swap-in users would spot the new "swapin_in_process" fake swp
entry and wait for the page lock (just like we do with migration entries).
Zap code would mostly only clear the "swapin_in_process" fake swp entry and
leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible
as that cannot race with page faults.
There might be one minor thing to optimize with the folio lock above. But in
essence, it would work just like migration entries, just that they are
installed only while we actually do read the content from disk etc.
That's a great idea. I was thinking to have the synchronization in the
page table but couldn't reach to the other non_swap_entry idea.
Only concern of the approach is that it would be harder to have the fix
in the stable tree. If there isn't strong objection, I prefer the
Kairui's orginal solution(with some tweak of scheduler if it's
necessary) first and then pursue your idea on latest tree.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2] mm/swap: fix race when skipping swapcache