On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?
Bah, i'm forever confused on the throttle cases there :/
I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:
queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p)
which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.
Yes, good riddance to that code :-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0;
- int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
Right, we don't appear to really need that.
if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
return 0;
goto out;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
return 0;
}goto out;
sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;
- ret = 1;
+out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }
- return 1;
- return ret;
}
So the difference is that we also do __block_task() when we get throttled somewhere in the hierarchy. IIRC when I was looking at this, I thought it wouldn't matter since it won't get picked anyway, on account of the cfs_rq being blocked/detached, so who cares.
But yeah, this makes sense.
Patch logistics are going to be a pain -- .17 is closed and merge window is open, which means Linus will have per-task throttle and /urgent don't work no more.
At this point best we can do is a patch to stable with a note that upstream is no longer affected due to rework or something.