On Tue, Aug 01, 2023 at 09:08:52PM +0200, Peter Zijlstra wrote:
On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
On 7/31/23 14:15, Peter Zijlstra wrote:
On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
Ha!, I was poking around the same thing. My hack below seems to (so far, <20 boots) help things.
So, dumb question: How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
That commit changes the timings of things; dumb luck otherwise.
Kind of scary. So I only experienced the problem because the START_DEBIT patch happened to be queued roughly at the same time, and it might otherwise have found its way unnoticed into the upstream kernel. That makes me wonder if this or other similar patches may uncover similar problems elsewhere in the kernel (i.e., either hide new or existing race conditions or expose existing ones).
This in turn makes me wonder if it would be possible to define a test which would uncover such problems without the START_DEBIT patch. Any idea ?
IIRC some of the thread sanitizers use breakpoints to inject random sleeps, specifically to tickle races.
I have heard of are some of these, arguably including KCSAN, but they would have a tough time on this one.
They would have to inject many milliseconds between the check of ->kthread_ptr in synchronize_rcu_tasks_generic() and that mutex_lock() in rcu_tasks_one_gp(). Plus this window only occurs during boot shortly before init is spawned.
On the other hand, randomly injecting delay just before acquiring each lock would cover this case. But such a sanitzer would still only get one shot per boot of the kernel for this particular bug.
Thanx, Paul