From: Matt Fleming mfleming@cloudflare.com
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.
This inconsistency can lead to tasks (task A) spinning indefinitely in wait_task_inactive(), e.g. when delivering a fatal signal to a thread group, because it thinks the task B is still queued (it's not) and waits forever for it to unschedule.
Task A Task B
arch_do_signal_or_restart() get_signal() do_coredump() coredump_wait() zap_threads() arch_do_signal_or_restart() wait_task_inactive() <-- SPIN get_signal() do_group_exit() do_exit() coredump_task_exit() schedule() <--- never comes back
Not only will task A spin forever in wait_task_inactive(), but task B will also trigger RCU stalls:
INFO: rcu_tasks detected stalls on tasks: 00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79 task:ffmpeg state:I stack:0 pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006 Call Trace: <TASK> __schedule+0x4fb/0xbf0 ? srso_return_thunk+0x5/0x5f schedule+0x27/0xf0 do_exit+0xdd/0xaa0 ? __pfx_futex_wake_mark+0x10/0x10 do_group_exit+0x30/0x80 get_signal+0x81e/0x860 ? srso_return_thunk+0x5/0x5f ? futex_wake+0x177/0x1a0 arch_do_signal_or_restart+0x2e/0x1f0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __x64_sys_futex+0x10c/0x1d0 syscall_exit_to_user_mode+0xa5/0x130 do_syscall_64+0x57/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f22d05b0f16 RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16 RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928 R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c </TASK>
Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Cc: Peter Zijlstra peterz@infradead.org Cc: Oleg Nesterov oleg@redhat.com Cc: John Stultz jstultz@google.com Cc: Chris Arges carges@cloudflare.com Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Matt Fleming mfleming@cloudflare.com --- kernel/sched/core.c | 6 ------ 1 file changed, 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ccba6fc3c3fe..2dfc3977920d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state * just go back and repeat. */ rq = task_rq_lock(p, &rf); - /* - * If task is sched_delayed, force dequeue it, to avoid always - * hitting the tick timeout in the queued case - */ - if (p->se.sched_delayed) - dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); trace_sched_wait_task(p); running = task_on_cpu(rq, p); queued = task_on_rq_queued(p);
On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming matt@readmodwrite.com wrote:
From: Matt Fleming mfleming@cloudflare.com
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.
Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down!
It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar
And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities().
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?
thanks -john
On Thu, Sep 25, 2025 at 5:05 PM John Stultz jstultz@google.com wrote:
It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar
Didn't finish my sentence there. "...similar problems as the one you are seeing could still occur."
thanks -john
Hello John, Matt,
On 9/26/2025 5:35 AM, John Stultz wrote:
On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming matt@readmodwrite.com wrote:
From: Matt Fleming mfleming@cloudflare.com
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.
Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down!
It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar
And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities().
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
That could very likely be it.
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?
I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:
queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p)
which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.
--- Thanks and Regards, Prateek
(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0; + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; }
sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;
+ ret = 1; +out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }
- return 1; + return ret; }
/*
On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak kprateek.nayak@amd.com wrote:
Hello John, Matt,
On 9/26/2025 5:35 AM, John Stultz wrote:
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
That could very likely be it.
That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes.
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me.
Tested-by: Matt Fleming mfleming@cloudflare.com
Hello Matt,
On 9/26/2025 9:04 PM, Matt Fleming wrote:
On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak kprateek.nayak@amd.com wrote:
Hello John, Matt,
On 9/26/2025 5:35 AM, John Stultz wrote:
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
That could very likely be it.
That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes.
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me.
Tested-by: Matt Fleming mfleming@cloudflare.com
Thank you for testing the diff. I see Ingo has already posted the scheduler pull for v6.18 which indirectly solves this by removing those early returns.
Once that lands, I'll attach a formal commit log and send out a patch targeting the stable kernels >= 6.12.
On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?
Bah, i'm forever confused on the throttle cases there :/
I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:
queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p)
which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.
Yes, good riddance to that code :-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0;
- int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
Right, we don't appear to really need that.
if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
return 0;
goto out;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
return 0;
}goto out;
sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;
- ret = 1;
+out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }
- return 1;
- return ret;
}
So the difference is that we also do __block_task() when we get throttled somewhere in the hierarchy. IIRC when I was looking at this, I thought it wouldn't matter since it won't get picked anyway, on account of the cfs_rq being blocked/detached, so who cares.
But yeah, this makes sense.
Patch logistics are going to be a pain -- .17 is closed and merge window is open, which means Linus will have per-task throttle and /urgent don't work no more.
At this point best we can do is a patch to stable with a note that upstream is no longer affected due to rework or something.
linux-stable-mirror@lists.linaro.org