(4) relock the folio. (we do that already, might not want to fail)
(4) take the PTE lock. If the PTE did not change, turn it into a present PTE entry. Otherwise, cleanup.
Very interesting idea!
I'm just not sure what actual benefit it brings. The only concern about reusing swapcache_prepare so far is repeated page faults that may hurt performance or statistics, this issue is basically gone after adding a schedule().
I think you know that slapping in random schedule() calls is not a proper way to wait for an event to happen :) It's pretty much unpredictable how long the schedule() will take and if there even is anything to schedule to!
With what I propose, just like with page migration, you really do wait for the event (page read and unlocked, only the PTE has to be updated) to happen before you try anything else.
Now, the difference is most likely that the case here happens much less frequently than page migration. Still, you could have all threads faulting one the same page and all would do the same dance here.
We can't drop all the operations around swap cache and map anyway. It doesn't know if it should skip the swapcache until swapcache lookup and swap count look up are all done. So I think it can be done more naturally here with a special value, making things simpler, robust, and improving performance a bit more.
The issue will always be that someone can zap the PTE concurrently, which would free up the swap cache. With what I propose, that cannot happen in the sync swapin case we are discussing here.
If someone where to zap the PTE in the meantime, it would only remove the special non-swap entry, indicating to swapin code that concurrent zapping happened. The swapin code would handle the swapcache freeing instead, after finishing reading the content.
So the swapcache would not get freed concurrently anymore if I am not wrong.
At least the theory, I didn't prototype it yet.
And in another series [1] I'm working on making shmem make use of cache bypassed swapin as well, following this approach I'll have to implement another shmem map based synchronization too.
I'd have to look further into that, if something like that could similarly apply to shmem. But yes, it's no using PTEs, so a PTE-based sync mechanism does definitely not apply..
After all it's only a rare race, I think a simpler solution might be better.
I'm not sure that simpler means better here. Simpler might be better for a backport, though.
The "use schedule() to wait" looks odd, maybe it's a common primitive that I simply didn't stumble over yet. (I doubt it but it could be possible)