On Tue, Feb 20, 2024 at 12:01 PM Barry Song 21cnbao@gmail.com wrote:
On Tue, Feb 20, 2024 at 4:42 PM Kairui Song ryncsn@gmail.com wrote:
On Tue, Feb 20, 2024 at 9:31 AM Andrew Morton akpm@linux-foundation.org wrote:
On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads swapin the same entry at the same time, they get different pages (A, B). Before one thread (T0) finishes the swapin and installs page (A) to the PTE, another thread (T1) could finish swapin of page (B), swap_free the entry, then swap out the possibly modified page reusing the same entry. It breaks the pte_same check in (T0) because PTE value is unchanged, causing ABA problem. Thread (T0) will install a stalled page (A) into the PTE and cause data corruption.
@@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) {
/*
* Prevent parallel swapin from proceeding with
* the cache flag. Otherwise, another thread may
* finish swapin first, free the entry, and swapout
* reusing the same entry. It's undetectable as
* pte_same() returns true due to entry reuse.
*/
if (swapcache_prepare(entry)) {
/* Relax a bit to prevent rapid repeated page faults */
schedule_timeout_uninterruptible(1);
Well this is unpleasant. How often can we expect this to occur?
The chance is very low, using the current mainline kernel and ZRAM, even with threads set to race on purpose using the reproducer I provides, for 647132 page faults it occured 1528 times (~0.2%).
If I run MySQL and sysbench with 128 threads and 16G buffer pool, with 6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min, 109930201 page faults in total (~0.001%).
Hi Barry,
it might not be a problem for throughput. but for real-time and tail latency, this hurts. For example, this might increase dropping frames of UI which is an important parameter to evaluate performance :-)
That's a true issue, as Chris mentioned before I think we need to think of some clever data struct to solve this more naturally in the future, similar issue exists for cached swapin as well and it has been there for a while. On the other hand I think maybe applications that are extremely latency sensitive should try to avoid swap on fault? A swapin could cause other issues like reclaim, throttled or contention with many other things, these seem to have a higher chance than this race.
BTW, I wonder if ying's previous proposal - moving swapcache_prepare() after swap_read_folio() will further help decrease the number?
We can move the swapcache_prepare after folio alloc or cgroup charge, but I didn't see an observable change from statistics, for some workload the reading is even worse. I think that's mostly due to noise, or higher swap out rate since all raced threads will alloc an extra folio now. Applications that have many pages swapped out due to memory limit are already on the edge of triggering another reclaim, so a dozen more folio alloc could just trigger that...
And we can't move it after swap_read_folio()... That's exactly what we want to protect.