This series is to backport following patches for v6.12: link: https://lore.kernel.org/lkml/20251107160645.929564468@infradead.org/
Peter Zijlstra (3): sched/fair: Revert max_newidle_lb_cost bump sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Small cleanup to update_newidle_cost() sched/fair: Proportional newidle balance
include/linux/sched/topology.h | 3 ++ kernel/sched/core.c | 3 ++ kernel/sched/fair.c | 74 +++++++++++++++++++++++----------- kernel/sched/features.h | 5 +++ kernel/sched/sched.h | 7 ++++ kernel/sched/topology.c | 6 +++ 6 files changed, 75 insertions(+), 23 deletions(-)
From: Peter Zijlstra peterz@infradead.org
commit d206fbad9328ddb68ebabd7cf7413392acd38081 upstream.
Many people reported regressions on their database workloads due to:
155213a2aed4 ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails")
For instance Adam Li reported a 6% regression on SpecJBB.
Conversely this will regress schbench again; on my machine from 2.22 Mrps/s down to 2.04 Mrps/s.
Reported-by: Joseph Salisbury joseph.salisbury@oracle.com Reported-by: Adam Li adamli@os.amperecomputing.com Reported-by: Dietmar Eggemann dietmar.eggemann@arm.com Reported-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com Link: https://patch.msgid.link/20251107161739.406147760@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8bdcb5df0..7ba5dd10e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12223,14 +12223,8 @@ static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) /* * Track max cost of a domain to make sure to not delay the * next wakeup on the CPU. - * - * sched_balance_newidle() bumps the cost whenever newidle - * balance fails, and we don't want things to grow out of - * control. Use the sysctl_sched_migration_cost as the upper - * limit, plus a litle extra to avoid off by ones. */ - sd->max_newidle_lb_cost = - min(cost, sysctl_sched_migration_cost + 200); + sd->max_newidle_lb_cost = cost; sd->last_decay_max_lb_cost = jiffies; } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) { /* @@ -12935,17 +12929,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; + update_newidle_cost(sd, domain_cost); + curr_cost += domain_cost; t0 = t1; - - /* - * Failing newidle means it is not effective; - * bump the cost so we end up doing less of it. - */ - if (!pulled_task) - domain_cost = (3 * sd->max_newidle_lb_cost) / 2; - - update_newidle_cost(sd, domain_cost); }
/*
From: Peter Zijlstra peterz@infradead.org
commit e78e70dbf603c1425f15f32b455ca148c932f6c1 upstream.
Pull out the !sd check to simplify code.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://patch.msgid.link/20251107161739.525916173@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ba5dd10e..b6637954e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12895,14 +12895,16 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); + if (!sd) { + rcu_read_unlock(); + goto out; + }
if (!get_rd_overloaded(this_rq->rd) || - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { + this_rq->avg_idle < sd->max_newidle_lb_cost) {
- if (sd) - update_next_balance(sd, &next_balance); + update_next_balance(sd, &next_balance); rcu_read_unlock(); - goto out; } rcu_read_unlock();
From: Peter Zijlstra peterz@infradead.org
commit 08d473dd8718e4a4d698b1113a14a40ad64a909b upstream.
Simplify code by adding a few variables.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://patch.msgid.link/20251107161739.655208666@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b6637954e..ae5da8f34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12219,22 +12219,25 @@ void update_max_interval(void)
static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) { + unsigned long next_decay = sd->last_decay_max_lb_cost + HZ; + unsigned long now = jiffies; + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the * next wakeup on the CPU. */ sd->max_newidle_lb_cost = cost; - sd->last_decay_max_lb_cost = jiffies; - } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) { + sd->last_decay_max_lb_cost = now; + + } else if (time_after(now, next_decay)) { /* * Decay the newidle max times by ~1% per second to ensure that * it is not outdated and the current max cost is actually * shorter. */ sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256; - sd->last_decay_max_lb_cost = jiffies; - + sd->last_decay_max_lb_cost = now; return true; }
From: Peter Zijlstra (Intel) peterz@infradead.org
commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream.
Add a randomized algorithm that runs newidle balancing proportional to its success rate.
This improves schbench significantly:
6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S
Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%:
6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1%
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 4237daa5a..3cf27591f 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -106,6 +106,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */
/* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4b1953b6c..b1895b330 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -118,6 +118,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
#ifdef CONFIG_SCHED_DEBUG /* @@ -8335,6 +8336,8 @@ void __init sched_init_smp(void) { sched_init_numa(NUMA_NO_NODE);
+ prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ae5da8f34..189681ab8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12217,11 +12217,27 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; }
-static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) +{ + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) { unsigned long next_decay = sd->last_decay_max_lb_cost + HZ; unsigned long now = jiffies;
+ if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -12269,7 +12285,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost;
/* @@ -12927,6 +12943,22 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) break;
if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + }
pulled_task = sched_balance_rq(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -12934,10 +12966,14 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); }
/* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 050d75030..da8ec0c23 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -122,3 +122,8 @@ SCHED_FEAT(WA_BIAS, true) SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(LATENCY_WARN, false) + +/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, true) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cf541c450..78b40c540 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H
+#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1348,6 +1349,12 @@ static inline bool is_migration_disabled(struct task_struct *p) }
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +}
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 4bd825c24..bd8b2b301 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1632,6 +1632,12 @@ sd_init(struct sched_domain_topology_level *tl,
.last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child,
Greg, following upstream patches will directly apply to v6.17, so I am not sending for v6.17:
https://github.com/torvalds/linux/commit/d206fbad9328ddb68ebabd7cf7413392acd... https://github.com/torvalds/linux/commit/e78e70dbf603c1425f15f32b455ca148c93... https://github.com/torvalds/linux/commit/08d473dd8718e4a4d698b1113a14a40ad64... https://github.com/torvalds/linux/commit/33cf66d88306663d16e4759e9d24766b0aa...
-Ajay
On Wed, Dec 3, 2025 at 5:08 PM Ajay Kaher ajay.kaher@broadcom.com wrote:
This series is to backport following patches for v6.12: link: https://lore.kernel.org/lkml/20251107160645.929564468@infradead.org/
Peter Zijlstra (3): sched/fair: Revert max_newidle_lb_cost bump sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Small cleanup to update_newidle_cost() sched/fair: Proportional newidle balance
include/linux/sched/topology.h | 3 ++ kernel/sched/core.c | 3 ++ kernel/sched/fair.c | 74 +++++++++++++++++++++++----------- kernel/sched/features.h | 5 +++ kernel/sched/sched.h | 7 ++++ kernel/sched/topology.c | 6 +++ 6 files changed, 75 insertions(+), 23 deletions(-)
-- 2.40.4
On Wed, Dec 03, 2025 at 05:23:05PM +0530, Ajay Kaher wrote:
Greg, following upstream patches will directly apply to v6.17, so I am not sending for v6.17:
https://github.com/torvalds/linux/commit/d206fbad9328ddb68ebabd7cf7413392acd... https://github.com/torvalds/linux/commit/e78e70dbf603c1425f15f32b455ca148c93... https://github.com/torvalds/linux/commit/08d473dd8718e4a4d698b1113a14a40ad64... https://github.com/torvalds/linux/commit/33cf66d88306663d16e4759e9d24766b0aa...
Please don't use github for kernel stuff....
Anyway, these are not in a -rc kernel yet, so I really shouldn't be taking them unless the author/maintainer agrees they should go in "right now". And given that these weren't even marked as cc: stable in the first place, why the rush?
Also, you forgot about 6.18.y, right?
thanks,
greg k-h
On Wed, Dec 3, 2025 at 6:46 PM Greg KH gregkh@linuxfoundation.org wrote:
On Wed, Dec 03, 2025 at 05:23:05PM +0530, Ajay Kaher wrote:
Greg, following upstream patches will directly apply to v6.17, so I am not sending for v6.17:
https://github.com/torvalds/linux/commit/d206fbad9328ddb68ebabd7cf7413392acd... https://github.com/torvalds/linux/commit/e78e70dbf603c1425f15f32b455ca148c93... https://github.com/torvalds/linux/commit/08d473dd8718e4a4d698b1113a14a40ad64... https://github.com/torvalds/linux/commit/33cf66d88306663d16e4759e9d24766b0aa...
Please don't use github for kernel stuff....
ok.
Anyway, these are not in a -rc kernel yet, so I really shouldn't be taking them unless the author/maintainer agrees they should go in "right now". And given that these weren't even marked as cc: stable in the first place, why the rush?
Agree. No rush.
Also, you forgot about 6.18.y, right?
Yes. However, upstream patches will directly apply till v6.17.
-Ajay
linux-stable-mirror@lists.linaro.org