Hello John, Matt,
On 9/26/2025 5:35 AM, John Stultz wrote:
On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming matt@readmodwrite.com wrote:
From: Matt Fleming mfleming@cloudflare.com
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.
Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down!
It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar
And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities().
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
That could very likely be it.
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?
I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:
queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p)
which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.
--- Thanks and Regards, Prateek
(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0; + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; }
sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;
+ ret = 1; +out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }
- return 1; + return ret; }
/*