Hi David,
On Thu, Feb 15, 2024 at 09:03:28PM +0100, David Hildenbrand wrote:
< snip >
We would detect later, that the PTE changed, but we would temporarily mess with that swap slot that we might no longer "own".
I was thinking about alternatives, it's tricky because of the concurrent MADV_DONTNEED possibility. Something with another fake-swap entry type (similar to migration entries) might work, but would require more changes.
Yeah, in the long term I also think more work is needed for the swap subsystem.
In my opinion, for this particular issue, or, for cache bypassed swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM might be needed, that may even help to simplify the swap count release routine for cache bypassed swapin, and improve the performance.
The question is if we really want to track that in the swapcache and not rather in the page table.
Imagine the following:
(1) allocate the folio and lock it (we do that already)
(2) take the page table lock. If the PTE is still the same, insert a new "swapin_in_process" fake swp entry that references the locked folio.
(3) read the folio from swap. This will unlock the folio IIUC. (we do that already)
(4) relock the folio. (we do that already, might not want to fail)
(4) take the PTE lock. If the PTE did not change, turn it into a present PTE entry. Otherwise, cleanup.
Any concurrent swap-in users would spot the new "swapin_in_process" fake swp entry and wait for the page lock (just like we do with migration entries).
Zap code would mostly only clear the "swapin_in_process" fake swp entry and leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible as that cannot race with page faults.
There might be one minor thing to optimize with the folio lock above. But in essence, it would work just like migration entries, just that they are installed only while we actually do read the content from disk etc.
That's a great idea. I was thinking to have the synchronization in the page table but couldn't reach to the other non_swap_entry idea.
Only concern of the approach is that it would be harder to have the fix in the stable tree. If there isn't strong objection, I prefer the Kairui's orginal solution(with some tweak of scheduler if it's necessary) first and then pursue your idea on latest tree.