On Wed, Aug 02, 2023 at 08:45:06AM -0700, Guenter Roeck wrote:
On 8/2/23 08:05, Paul E. McKenney wrote:
On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
Please see below for my preferred fix. Does this work for you guys?
Back to figuring out why recent kernels occasionally to blow up all rcutorture guest OSes...
Thanx, Paul
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 7294be62727b..2d5b8385c357 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot) if (unlikely(midboot)) { needgpcb = 0x2; } else { + mutex_unlock(&rtp->tasks_gp_mutex); set_tasks_gp_state(rtp, RTGS_WAIT_CBS); rcuwait_wait_event(&rtp->cbs_wait, (needgpcb = rcu_tasks_need_gpcb(rtp)), TASK_IDLE); + mutex_lock(&rtp->tasks_gp_mutex); } if (needgpcb & 0x2) {
Your preferred fix looks good to me.
With the original code I can quite easily reproduce the problem on my system every 10 reboots or so. With your fix in place the problem no longer occurs.
Very good, thank you! May I add your Tested-by?
FWIW, I am still working on it. So far I get
[ 8.191589] KTAP version 1 [ 8.191769] # Subtest: kunit_executor_test [ 8.191972] # module: kunit [ 8.192012] 1..8 [ 8.197643] ok 1 parse_filter_test [ 8.201851] ok 2 filter_suites_test [ 8.206713] ok 3 filter_suites_test_glob_test [ 8.211806] ok 4 filter_suites_to_empty_test [ 8.214077] kunit executor: filter operation not found: speed>slow, module!=example [ 8.217933] # parse_filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:126 [ 8.217933] Expected err == 0, but [ 8.217933] err == -22 (0xffffffffffffffea) [ 8.217933] [ 8.217933] failed to parse filter '(efault)' [ 8.221266] not ok 5 parse_filter_attr_test [ 8.224224] kunit executor: filter operation not found: speed>slow [ 8.225837] # filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:165 [ 8.225837] Expected err == 0, but [ 8.225837] err == -22 (0xffffffffffffffea) [ 8.228850] not ok 6 filter_attr_test [ 8.230942] kunit executor: filter operation not found: module!=dummy [ 8.232167] # filter_attr_empty_test: ASSERTION FAILED at lib/kunit/executor_test.c:190 [ 8.232167] Expected err == 0, but [ 8.232167] err == -22 (0xffffffffffffffea) [ 8.235317] not ok 7 filter_attr_empty_test [ 8.237065] kunit executor: filter operation not found: speed>slow [ 8.238796] # filter_attr_skip_test: ASSERTION FAILED at lib/kunit/executor_test.c:209 [ 8.238796] Expected err == 0, but [ 8.238796] err == -22 (0xffffffffffffffea) [ 8.241897] not ok 8 filter_attr_skip_test [ 8.241947] # kunit_executor_test: pass:4 fail:4 skip:0 total:8 [ 8.242144] # Totals: pass:4 fail:4 skip:0 total:8
and it looks like the console no longer works. Most likely this is some other problem that was introduced while tests were broken. It will take me some time to track that down.
No rush.
Given that this bug is a year old, that it happens only when debug options are enabled, and that it has only been seen in current -next, my plan is to submit it into the next merge window.
So this one stays mutable for about another 10 days.
On the strength of Roy's Tested-by, however, I will push this patch into -next soon, so that should make things a bit easier. Or so I hope.
And again, thank you all for tracking this down!
Thanx, Paul