Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

1 Aug 2023


      On Tue, Aug 01, 2023 at 09:08:52PM +0200, Peter Zijlstra wrote:
...
On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
...
On 7/31/23 14:15, Peter Zijlstra wrote:
...
On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
...
...
Ha!, I was poking around the same thing. My hack below seems to (so far,
<20 boots) help things.
So, dumb question:
How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
That commit changes the timings of things; dumb luck otherwise.
Kind of scary. So I only experienced the problem because the START_DEBIT patch
happened to be queued roughly at the same time, and it might otherwise have
found its way unnoticed into the upstream kernel. That makes me wonder if this
or other similar patches may uncover similar problems elsewhere in the kernel
(i.e., either hide new or existing race conditions or expose existing ones).
This in turn makes me wonder if it would be possible to define a test which
would uncover such problems without the START_DEBIT patch. Any idea ?
IIRC some of the thread sanitizers use breakpoints to inject random
sleeps, specifically to tickle races.
I have heard of are some of these, arguably including KCSAN, but they
would have a tough time on this one.
They would have to inject many milliseconds between the check of
->kthread_ptr in synchronize_rcu_tasks_generic() and that mutex_lock()
in rcu_tasks_one_gp().  Plus this window only occurs during boot shortly
before init is spawned.
On the other hand, randomly injecting delay just before acquiring each
lock would cover this case.  But such a sanitzer would still only get
one shot per boot of the kernel for this particular bug.
Thanx, Paul

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)