From: Matt Fleming mfleming@cloudflare.com
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.
This inconsistency can lead to tasks (task A) spinning indefinitely in wait_task_inactive(), e.g. when delivering a fatal signal to a thread group, because it thinks the task B is still queued (it's not) and waits forever for it to unschedule.
Task A Task B
arch_do_signal_or_restart() get_signal() do_coredump() coredump_wait() zap_threads() arch_do_signal_or_restart() wait_task_inactive() <-- SPIN get_signal() do_group_exit() do_exit() coredump_task_exit() schedule() <--- never comes back
Not only will task A spin forever in wait_task_inactive(), but task B will also trigger RCU stalls:
INFO: rcu_tasks detected stalls on tasks: 00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79 task:ffmpeg state:I stack:0 pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006 Call Trace: <TASK> __schedule+0x4fb/0xbf0 ? srso_return_thunk+0x5/0x5f schedule+0x27/0xf0 do_exit+0xdd/0xaa0 ? __pfx_futex_wake_mark+0x10/0x10 do_group_exit+0x30/0x80 get_signal+0x81e/0x860 ? srso_return_thunk+0x5/0x5f ? futex_wake+0x177/0x1a0 arch_do_signal_or_restart+0x2e/0x1f0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __x64_sys_futex+0x10c/0x1d0 syscall_exit_to_user_mode+0xa5/0x130 do_syscall_64+0x57/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f22d05b0f16 RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16 RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928 R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c </TASK>
Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Cc: Peter Zijlstra peterz@infradead.org Cc: Oleg Nesterov oleg@redhat.com Cc: John Stultz jstultz@google.com Cc: Chris Arges carges@cloudflare.com Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Matt Fleming mfleming@cloudflare.com --- kernel/sched/core.c | 6 ------ 1 file changed, 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ccba6fc3c3fe..2dfc3977920d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state * just go back and repeat. */ rq = task_rq_lock(p, &rf); - /* - * If task is sched_delayed, force dequeue it, to avoid always - * hitting the tick timeout in the queued case - */ - if (p->se.sched_delayed) - dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); trace_sched_wait_task(p); running = task_on_cpu(rq, p); queued = task_on_rq_queued(p);