On Mon, Aug 28, 2023 at 10:14:20AM +0000, Maximilian Heyne wrote:
On Sun, Aug 27, 2023 at 10:54:03AM +0200, Greg KH wrote:
On Wed, Aug 23, 2023 at 06:16:42AM +0000, Maximilian Heyne wrote:
From: Linus Torvalds torvalds@linux-foundation.org
[ upstream commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79 ]
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order.
That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it.
It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput.
Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring may end up exposing an ABBA deadlock that is hidden by the usual optimistic lock stealing, and while the unfairness doesn't fix the fundamental issue (and I'm still looking at that), it avoids it in practice.
The amount of unfairness can be modified by writing a new value to the 'sysctl_page_lock_unfairness' variable (default value of 5, exposed through /proc/sys/vm/page_lock_unfairness), but that is hopefully something we'd use mainly for debugging rather than being necessary for any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@M... Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=... Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@t... Reported-and-tested-by: Michael Larabel Michael@michaellarabel.com Tested-by: Matthieu Baerts matthieu.baerts@tessares.net Cc: Dave Chinner david@fromorbit.com Cc: Matthew Wilcox willy@infradead.org Cc: Chris Mason clm@fb.com Cc: Jan Kara jack@suse.cz Cc: Amir Goldstein amir73il@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org CC: stable@vger.kernel.org # 5.4 [ mheyne: fixed contextual conflict in mm/filemap.c due to missing commit c7510ab2cf5c ("mm: abstract out wake_page_match() from wake_page_function()"). Added WQ_FLAG_CUSTOM due to missing commit 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem") ] Signed-off-by: Maximilian Heyne mheyne@amazon.de
include/linux/mm.h | 2 + include/linux/wait.h | 2 + kernel/sysctl.c | 8 +++ mm/filemap.c | 160 ++++++++++++++++++++++++++++++++++--------- 4 files changed, 141 insertions(+), 31 deletions(-)
This was also backported here: https://lore.kernel.org/r/20230821222547.483583-1-saeed.mirzamohammadi@oracl... before yours.
I took that one, can you verify that it is identical to yours and works properly as well?
Yes it's identical and fixes the performance regression seen. Therefore,
Tested-by: Maximilian Heyne mheyne@amazon.de
for the other patch.
Thanks, I've added this to the patch now.
greg k-h