[PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

List overview All Threads
Download

newer

older

[PATCH RESEND] drm/tegra: dc: fix...

[PATCH 6.1.y v2] Revert...

Matt Fleming

25 Sep 2025 25 Sep '25

1:33 p.m.

From: Matt Fleming mfleming@cloudflare.com

This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.

If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.

This inconsistency can lead to tasks (task A) spinning indefinitely in wait_task_inactive(), e.g. when delivering a fatal signal to a thread group, because it thinks the task B is still queued (it's not) and waits forever for it to unschedule.

Task A Task B

arch_do_signal_or_restart() get_signal() do_coredump() coredump_wait() zap_threads() arch_do_signal_or_restart() wait_task_inactive() <-- SPIN get_signal() do_group_exit() do_exit() coredump_task_exit() schedule() <--- never comes back

Not only will task A spin forever in wait_task_inactive(), but task B will also trigger RCU stalls:

INFO: rcu_tasks detected stalls on tasks: 00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79 task:ffmpeg state:I stack:0 pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006 Call Trace: <TASK> __schedule+0x4fb/0xbf0 ? srso_return_thunk+0x5/0x5f schedule+0x27/0xf0 do_exit+0xdd/0xaa0 ? __pfx_futex_wake_mark+0x10/0x10 do_group_exit+0x30/0x80 get_signal+0x81e/0x860 ? srso_return_thunk+0x5/0x5f ? futex_wake+0x177/0x1a0 arch_do_signal_or_restart+0x2e/0x1f0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __x64_sys_futex+0x10c/0x1d0 syscall_exit_to_user_mode+0xa5/0x130 do_syscall_64+0x57/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f22d05b0f16 RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16 RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928 R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c </TASK>

Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Cc: Peter Zijlstra peterz@infradead.org Cc: Oleg Nesterov oleg@redhat.com Cc: John Stultz jstultz@google.com Cc: Chris Arges carges@cloudflare.com Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Matt Fleming mfleming@cloudflare.com --- kernel/sched/core.c | 6 ------ 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ccba6fc3c3fe..2dfc3977920d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state * just go back and repeat. */ rq = task_rq_lock(p, &rf); - /* - * If task is sched_delayed, force dequeue it, to avoid always - * hitting the tick timeout in the queued case - */ - if (p->se.sched_delayed) - dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); trace_sched_wait_task(p); running = task_on_cpu(rq, p); queued = task_on_rq_queued(p);

-- 2.34.1

Show replies by date

John Stultz

26 Sep 26 Sep

12:05 a.m.

On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming matt@readmodwrite.com wrote:

...

From: Matt Fleming mfleming@cloudflare.com

This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.

If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.

Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down!

It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar

And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities().

However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?

Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?

thanks -john

John Stultz

12:06 a.m.

On Thu, Sep 25, 2025 at 5:05 PM John Stultz jstultz@google.com wrote:

...

It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar

Didn't finish my sentence there. "...similar problems as the one you are seeing could still occur."

thanks -john

K Prateek Nayak

2:43 a.m.

Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:

...

On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming matt@readmodwrite.com wrote:

...
From: Matt Fleming mfleming@cloudflare.com

This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.

If we dequeue a task (task B) that was sched delayed then that task is definitely no longer on the rq and not tracked in the rbtree. Unfortunately, task_on_rq_queued(B) will still return true because dequeue_task() doesn't update p->on_rq.

Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down!

It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar

And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities().

However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?

That could very likely be it.

...

Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?

I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:

queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p)

which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!

Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.

P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.

--- Thanks and Regards, Prateek

(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0; + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */

if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out;

/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; }

sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;

+ ret = 1; +out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }

- return 1; + return ret; }

Matt Fleming

3:34 p.m.

On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak kprateek.nayak@amd.com wrote:

...

Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:

...
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?

That could very likely be it.

That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes.

...

Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.

I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me.

Tested-by: Matt Fleming mfleming@cloudflare.com

K Prateek Nayak

29 Sep 29 Sep

3:53 a.m.

Hello Matt,

On 9/26/2025 9:04 PM, Matt Fleming wrote:

...

On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak kprateek.nayak@amd.com wrote:

...
Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:

...
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?

That could very likely be it.

That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes.

...
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.

I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me.

Tested-by: Matt Fleming mfleming@cloudflare.com

Thank you for testing the diff. I see Ingo has already posted the scheduler pull for v6.18 which indirectly solves this by removing those early returns.

Once that lands, I'll attach a formal commit log and send out a patch targeting the stable kernels >= 6.12.

-- Thanks and Regards, Prateek

Matt Fleming

14 Oct 14 Oct

10:15 a.m.

On Mon, Sep 29, 2025 at 09:23:06AM +0530, K Prateek Nayak wrote:

...

Thank you for testing the diff. I see Ingo has already posted the scheduler pull for v6.18 which indirectly solves this by removing those early returns.

Once that lands, I'll attach a formal commit log and send out a patch targeting the stable kernels >= 6.12.

Hey Prateek, I see the merge window is closed. Do you have plans to submit a patch for stable now?

K Prateek Nayak

15 Oct 15 Oct

6:03 a.m.

New subject: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

Dequeuing a fair task on a throttled hierarchy returns early on encountering a throttled cfs_rq since the throttle path has already dequeued the hierarchy above and has adjusted the h_nr_* accounting till the root cfs_rq.

dequeue_entities() crucially misses calling __block_task() for delayed tasks being dequeued on the throttled hierarchies, but this was mostly harmless until commit b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") since all existing cases would re-enqueue the task if task_on_rq_queued() returned true and the task would eventually be blocked at pick after the hierarchy was unthrottled.

wait_task_inactive() is special as it expects the delayed task on throttled hierarchy to reach the blocked state on dequeue but since __block_task() is never called, task_on_rq_queued() continues to return true. Furthermore, since the task is now off the hierarchy, the pick never reaches it to fully block the task even after unthrottle leading to wait_task_inactive() looping endlessly.

Remedy this by calling __block_task() if a delayed task is being dequeued on a throttled hierarchy.

This fix is only required for stabled kernels implementing delay dequeue (>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair: Switch to task based throttle model") indirectly fixes this by removing the early return conditions in dequeue_entities() as part of the per-task throttle feature.

Cc: stable@vger.kernel.org Reported-by: Matt Fleming matt@readmodwrite.com Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/ Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Tested-by: Matt Fleming mfleming@cloudflare.com Signed-off-by: K Prateek Nayak kprateek.nayak@amd.com --- Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature and we would only like these bits backported. If there are any future conflicts in this area during backporting, I would be more than happy to help out resolve them. --- kernel/sched/fair.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index af61769b1d50..b3d9826e25b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7187,6 +7187,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_delayed = 0; struct cfs_rq *cfs_rq; u64 slice = 0; + int ret = 0;

if (entity_is_task(se)) { p = task_of(se); @@ -7218,7 +7219,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out;

/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7261,7 +7262,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; }

sub_nr_running(rq, h_nr_queued); @@ -7273,6 +7274,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;

+ ret = 1; +out: if (p && task_delayed) { SCHED_WARN_ON(!task_sleep); SCHED_WARN_ON(p->on_rq != 1); @@ -7288,7 +7291,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }

- return 1; + return ret; }

-- 2.34.1

Greg Kroah-Hartman

6:14 a.m.

New subject: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

On Wed, Oct 15, 2025 at 06:03:59AM +0000, K Prateek Nayak wrote:

...

Dequeuing a fair task on a throttled hierarchy returns early on encountering a throttled cfs_rq since the throttle path has already dequeued the hierarchy above and has adjusted the h_nr_* accounting till the root cfs_rq.

dequeue_entities() crucially misses calling __block_task() for delayed tasks being dequeued on the throttled hierarchies, but this was mostly harmless until commit b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") since all existing cases would re-enqueue the task if task_on_rq_queued() returned true and the task would eventually be blocked at pick after the hierarchy was unthrottled.

wait_task_inactive() is special as it expects the delayed task on throttled hierarchy to reach the blocked state on dequeue but since __block_task() is never called, task_on_rq_queued() continues to return true. Furthermore, since the task is now off the hierarchy, the pick never reaches it to fully block the task even after unthrottle leading to wait_task_inactive() looping endlessly.

Remedy this by calling __block_task() if a delayed task is being dequeued on a throttled hierarchy.

This fix is only required for stabled kernels implementing delay dequeue (>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair: Switch to task based throttle model") indirectly fixes this by removing the early return conditions in dequeue_entities() as part of the per-task throttle feature.

Cc: stable@vger.kernel.org Reported-by: Matt Fleming matt@readmodwrite.com Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/ Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Tested-by: Matt Fleming mfleming@cloudflare.com Signed-off-by: K Prateek Nayak kprateek.nayak@amd.com

Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature and we would only like these bits backported. If there are any future conflicts in this area during backporting, I would be more than happy to help out resolve them.

Why not just backport all of the mainline changes instead? As I say a lot, whenever we do these "one off" changes, it's almost always wrong and causes problems over the years going forward as other changes around the same area can not be backported either.

So please, try to just backport the original commits.

thanks,

greg k-h

K Prateek Nayak

6:27 a.m.

New subject: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

Hello Greg,

On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:

...

...
Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature and we would only like these bits backported. If there are any future conflicts in this area during backporting, I would be more than happy to help out resolve them.

Why not just backport all of the mainline changes instead? As I say a lot, whenever we do these "one off" changes, it's almost always wrong and causes problems over the years going forward as other changes around the same area can not be backported either.

So please, try to just backport the original commits.

Peter was in favor of backporting just the necessary bits in https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks...

Backporting the whole of per-task throttle feature is lot more heavy handed with the core changes adding:

And a few more fixes that will add to the above before v6.18. I'll defer to Peter to decide the best course of action.

-- Thanks and Regards, Prateek

Greg Kroah-Hartman

7:27 a.m.

New subject: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

On Wed, Oct 15, 2025 at 11:57:19AM +0530, K Prateek Nayak wrote:

...

Hello Greg,

On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:

...
...
Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature and we would only like these bits backported. If there are any future conflicts in this area during backporting, I would be more than happy to help out resolve them.

Why not just backport all of the mainline changes instead? As I say a lot, whenever we do these "one off" changes, it's almost always wrong and causes problems over the years going forward as other changes around the same area can not be backported either.

So please, try to just backport the original commits.

Peter was in favor of backporting just the necessary bits in https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks...

Backporting the whole of per-task throttle feature is lot more heavy handed with the core changes adding:

include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 451 ++++++++++++++++++++++++------------------ kernel/sched/pelt.h | 4 +- kernel/sched/sched.h | 7 +- 5 files changed, 274 insertions(+), 196 deletions(-)

That's very tiny overall in the scheme of what we take for the stable trees.

...

And a few more fixes that will add to the above before v6.18. I'll defer to Peter to decide the best course of action.

We'll defer to the maintainers of the subsystem as to what they want here. If they say take this smaller patch, we'll be glad to do so.

thanks,

greg k-h

Peter Zijlstra

8:22 a.m.

New subject: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

On Wed, Oct 15, 2025 at 09:27:53AM +0200, Greg Kroah-Hartman wrote:

...

On Wed, Oct 15, 2025 at 11:57:19AM +0530, K Prateek Nayak wrote:

...
Hello Greg,

On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:

...
...
Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature and we would only like these bits backported. If there are any future conflicts in this area during backporting, I would be more than happy to help out resolve them.

Why not just backport all of the mainline changes instead? As I say a lot, whenever we do these "one off" changes, it's almost always wrong and causes problems over the years going forward as other changes around the same area can not be backported either.

So please, try to just backport the original commits.

Peter was in favor of backporting just the necessary bits in https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks...

Backporting the whole of per-task throttle feature is lot more heavy handed with the core changes adding:

include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 451 ++++++++++++++++++++++++------------------ kernel/sched/pelt.h | 4 +- kernel/sched/sched.h | 7 +- 5 files changed, 274 insertions(+), 196 deletions(-)

That's very tiny overall in the scheme of what we take for the stable trees.

...
And a few more fixes that will add to the above before v6.18. I'll defer to Peter to decide the best course of action.

We'll defer to the maintainers of the subsystem as to what they want here. If they say take this smaller patch, we'll be glad to do so.

So if the timing of all this would've been slightly different, I'd have taken the smaller patch and routed it through sched/urgent in time for the 6.17 release. And we'd not have had this discussion, but alas.

Also, while we're fairly confident in the task throttling rework, it is somewhat invasive and hasn't seen widespread testing -- it is conceivable there are performance issues.

So at this point I would really rather backport the smaller patch that fixes the immediate bug.

Thanks!

K Prateek Nayak

23 Oct 23 Oct

4:03 a.m.

New subject: [PATCH 6.17] sched/fair: Block delayed tasks on throttled hierarchy during dequeue

Remedy this by calling __block_task() if a delayed task is being dequeued on a throttled hierarchy.

Please consider the same fix for the v6.17 stable kernel too since there is a report of a similar issue on v6.17.1 based RT kernel at https://lore.kernel.org/lkml/aPN7XBJbGhdWJDb2@uudg.org/ and Luis confirmed that this fix solves the issue for him in https://lore.kernel.org/lkml/aPgm6KvDx5Os2oJS@uudg.org/ --- kernel/sched/fair.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out;

/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)

/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; }

- return 1; + return ret; }

base-commit: 6c7871823908a4330e145d635371582f76ce1407

-- 2.34.1

Peter Zijlstra

29 Sep 29 Sep

10:38 a.m.

On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:

...

...
Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right?

Bah, i'm forever confused on the throttle cases there :/

...

I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a:
queued = task_on_rq_queued(p);
...
if (queued)
    enqueue_task(p)
which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats!

Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.

P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels.

Yes, good riddance to that code :-)

...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0;

int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */

Right, we don't appear to really need that.

...

if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
	return 0;
	goto out;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq))
	return 0;
	goto out;
}
sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;

ret = 1;

+out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); }

return 1;

return ret;

}

So the difference is that we also do __block_task() when we get throttled somewhere in the hierarchy. IIRC when I was looking at this, I thought it wouldn't matter since it won't get picked anyway, on account of the cfs_rq being blocked/detached, so who cares.

But yeah, this makes sense.

Patch logistics are going to be a pain -- .17 is closed and merge window is open, which means Linus will have per-task throttle and /urgent don't work no more.

At this point best we can do is a patch to stable with a note that upstream is no longer affected due to rework or something.

days inactive

103

days old

linux-stable-mirror@lists.linaro.org

13 comments

participants

tags (0)

participants (5)

Greg Kroah-Hartman
John Stultz
K Prateek Nayak
Matt Fleming
Peter Zijlstra