On Thu, Feb 15, 2024 at 8:44 AM Minchan Kim minchan@kernel.org wrote:
Hi Kairui,
Hi, Minchan,
I also played the swap refcount stuff ideas and encoutered a lot problems(e.g., kernel crash and/or data missing).
Now I realize your original solution would be best to fix under such a small change.
Folks, please chime in if you have another idea.
Instead if we add a schedule or schedule_timeout_uninterruptible(1),
How much your workload is going If we use schedule instead of schedule_timeout_uninterruptible(1)? If that doesn't increase the statistics a lot, I prefer the schedule. (TBH, I didn't care much about the stat since it would be very rare).
I've tested both schedule() and schedule_timeout_uninterruptible(1) locally using the reproducer, and some other cases, and looked all good.
The reproducer I provided demonstrates an extreme case, so I think schedule() is good enough already.
I currently don't have any more benchmarks or tests, as the change is very small for most workloads. I'll use schedule() here in V3 if no one else has other suggestions.
I remember Barry's series about large folio swapin may change the ZRAM read time depending on folio size, and cause strange races that's different from the reproducer. Not sure if I can test using some known race cases. That could be helpful to verify this fix and if schedule() or schedule_timeout_uninterruptible(1) is better here.
it seems quite enough to avoid repeated page faults. I did following test with the reproducer I provided in the commit message:
Using ZRAM instead of brd for more realistic workload: $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
Unpatched kernel: # time counts unit events 30.000096504 33,089 minor-faults:u 30.000096504 958,745 major-faults:u 60.000375308 22,150 minor-faults:u 60.000375308 1,221,665 major-faults:u 90.001062176 24,840 minor-faults:u 90.001062176 1,005,427 major-faults:u 120.004233268 23,634 minor-faults:u 120.004233268 1,045,596 major-faults:u 150.004530091 22,358 minor-faults:u 150.004530091 1,097,871 major-faults:u
Patched kernel: # time counts unit events 30.000069753 280,094 minor-faults:u 30.000069753 1,198,871 major-faults:u 60.000415915 436,862 minor-faults:u 60.000415915 919,860 major-faults:u 90.000651670 359,176 minor-faults:u 90.000651670 955,340 major-faults:u 120.001088135 289,137 minor-faults:u 120.001088135 992,317 major-faults:u
Indeed there is a huge increase of minor page faults, which indicate the raced path returned without handling the fault. This reproducer tries to maximize the race chance so the reading is more terrifying than usual.
But after adding a schedule/schedule_timeout_uninterruptible(1) after swapcache_prepare failed, still using the same test:
Patched kernel (adding a schedule() before goto out): # time counts unit events 30.000335901 43,066 minor-faults:u 30.000335901 1,163,232 major-faults:u 60.034629687 35,652 minor-faults:u 60.034629687 844,188 major-faults:u 90.034941472 34,957 minor-faults:u 90.034941472 1,218,174 major-faults:u 120.035255386 36,241 minor-faults:u 120.035255386 1,073,395 major-faults:u 150.035535786 33,057 minor-faults:u 150.035535786 1,171,684 major-faults:u
Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out): # time counts unit events 30.000044452 26,748 minor-faults:u 30.000044452 1,202,064 major-faults:u 60.000317379 19,883 minor-faults:u 60.000317379 1,186,152 major-faults:u 90.000568779 18,333 minor-faults:u 90.000568779 1,151,015 major-faults:u 120.000787253 22,844 minor-faults:u 120.000787253 887,531 major-faults:u 150.001309065 18,900 minor-faults:u 150.001309065 1,211,894 major-faults:u
The minor faults are basically the same as before, or even lower since other races are also reduced, with no observable performance regression so far. If the vague margin of this schedule call is still concerning, I think another approach is just a new swap map value for parallel cache bypassed swapping to loop on.
Long term, the swap map vaule would be good but for now, I prefer your original solution with some delay with schedule.
Yes, that's also what I have in mind.
With a swap map value, actually I think the cache bypassed path may even go faster as it can simplify some swap map value change process, but that requires quite some extra work and change, could be introduced later as an optimization.
Thanks, Kairui!
Thanks for the comments!