Hi, We (Ateme, a video encoding company) may have found an unwanted behavior in the scheduler since 5.10 (commit 16b0a7a1a0af), then 5.16 (commit c5b0a7eefc70), then 5.19 (commit not found yet), then maybe some other commits from 5.19 to 6.12, with a consequence of IPC decrease. Problem still appears on lasts 6.12, 6.13 and 6.14
We have reverted both 16b0a7a1a0af and c5b0a7eefc70 commits that reduce our performances (see fair.patch attached, applicable on 6.12.17). Performances increase but still doesnt reach our reference on 5.4.152.
Instead of trying to find every single commits from 5.18 to 6.12 that could decrease our performance, I chosed to bench 5.4.152 versus 6.12.17 with and without fair.patch.
The problem appeared clear : a lot of CPU migrations go out of CCX, then L3 miss, then IPC decrease.
Context of our bench: video decoder which work at a regulated speed, 1 process, 21 main threads, everyone of them creates 10 threads, 8 of them have a fine granularity, meaning they go to sleep quite often, giving the scheduler a lot of opportunities to act). Hardware is an AMD Epyc 7702P, 128 cores, grouped by shared LLC 4 cores +4 hyperthreaded cores. NUMA topology is set by the BIOS to 1 node per socket. Every pthread are created with default attributes. I use AMDuProf (-C -A system -a -m ipc,l1,l2,l3,memory) for CPU Utilization (%), CPU effective freq, IPC, L2 access (pti), L2 miss (pti), L3 miss (absolute) and Mem (GB/s, and perf (stat -d -d -d -a) for Context switches, CPU migrations and Real time (s).
We noted that upgrade 5.4.152 to 6.12.17 without any special preempt configuration : Two fold increase in CPU migration 30% memory bandwidth increase 20% L3 cache misses increase 10% IPC decrease
With the attached fair.patch applied to 6.12.17 (reminder : this patch reverts one commit appeared in 5.10 and another in 5.16) we managed to reduce CPU migrations and increase IPC but not as much as we had on 5.4.152. Our goal is to keep kernel "clean" without any patch (we don't want to apply and maintain fair.patch) then for the rest of my email we will consider stock kernel 6.12.17.
I've reduced the "sub threads count" to stays below 128 threads. Then still 21 main threads and instead of 10 worker per main thread I set 5 workers (4 of them with fine granularity) giving 105 pthreads -> everything goes fine in 6.12.17, no extra CPU migration, no extra memory bandwidth...
But as soon as we increase worker threads count (10 instead of 5) the problem appears.
We know our decoder may have too many threads but that's out of our scope, it has been designed like that some years ago and moving from "lot of small threads to few of big thread" is for now not possible.
We have a work around : we group threads using pthread affinities. Every main thread (and by inheritance of affinities every worker threads) on a single CCX so we reduce the L3 miss for them, then decrease memory bandwidth, then finally increasing IPC.
With that solution, we go above our original performances, for both kernels, and they perform at the same level. However, it is impractical to productize as such.
I've tried many kernel build configurations (CONFIG_PREMPT_*, CONFIG_SCHEDULER_*, tuning of fair.c:sysctl_sched_migration_cost) on 6.12.17, 6.12.21 (longterm), 6.13.9 (mainline), and 6.14.0 Nothing changes.
Q: Is there anyway to tune the kernel so we can get our performance back without using the pthread affinities work around ?
Feel free to ask an archive containing binaries and payload.
I first posted on https://bugzilla.kernel.org/show_bug.cgi?id=220000 but one told me the best way to get answers where these mailing lists
Regards,
Jean-Baptiste Roquefere, Ateme
Attached bench.tar.gz : * bench/fair.patch * bench/bench.ods with 2 sheets : - regulated : decoder speed is regulated to keep real time constant - no regul : decoder speed is not regulated and uses from 1 to 76 main threads with 10 worker per main thread * bench/regulated.csv : bench.ods:regulated exported in csv format * bench/not-regulated : bench.ods:no regul exported in csv format
Hello Jean,
On 4/18/2025 2:38 AM, Jean-Baptiste Roquefere wrote:
Hi, We (Ateme, a video encoding company) may have found an unwanted behavior in the scheduler since 5.10 (commit 16b0a7a1a0af), then 5.16 (commit c5b0a7eefc70), then 5.19 (commit not found yet), then maybe some other commits from 5.19 to 6.12, with a consequence of IPC decrease. Problem still appears on lasts 6.12, 6.13 and 6.14
Looking at the commit logs, it looks like these commits do solve other problems around load balancing and might not be trivial to revert without evaluating the damages.
We have reverted both 16b0a7a1a0af and c5b0a7eefc70 commits that reduce our performances (see fair.patch attached, applicable on 6.12.17). Performances increase but still doesnt reach our reference on 5.4.152.
Instead of trying to find every single commits from 5.18 to 6.12 that could decrease our performance, I chosed to bench 5.4.152 versus 6.12.17 with and without fair.patch.
The problem appeared clear : a lot of CPU migrations go out of CCX, then L3 miss, then IPC decrease.
Context of our bench: video decoder which work at a regulated speed, 1 process, 21 main threads, everyone of them creates 10 threads, 8 of them have a fine granularity, meaning they go to sleep quite often, giving the scheduler a lot of opportunities to act). Hardware is an AMD Epyc 7702P, 128 cores, grouped by shared LLC 4 cores +4 hyperthreaded cores. NUMA topology is set by the BIOS to 1 node per socket. Every pthread are created with default attributes. I use AMDuProf (-C -A system -a -m ipc,l1,l2,l3,memory) for CPU Utilization (%), CPU effective freq, IPC, L2 access (pti), L2 miss (pti), L3 miss (absolute) and Mem (GB/s, and perf (stat -d -d -d -a) for Context switches, CPU migrations and Real time (s).
We noted that upgrade 5.4.152 to 6.12.17 without any special preempt configuration : Two fold increase in CPU migration 30% memory bandwidth increase 20% L3 cache misses increase 10% IPC decrease
With the attached fair.patch applied to 6.12.17 (reminder : this patch reverts one commit appeared in 5.10 and another in 5.16) we managed to reduce CPU migrations and increase IPC but not as much as we had on 5.4.152. Our goal is to keep kernel "clean" without any patch (we don't want to apply and maintain fair.patch) then for the rest of my email we will consider stock kernel 6.12.17.
I've reduced the "sub threads count" to stays below 128 threads. Then still 21 main threads and instead of 10 worker per main thread I set 5 workers (4 of them with fine granularity) giving 105 pthreads -> everything goes fine in 6.12.17, no extra CPU migration, no extra memory bandwidth...
The processor you are running on, the AME EPYC 7702P based on the Zen2 architecture contains 4 cores / 8 threads per CCX (LLC domain) which is perhaps why reducing the thread count to below this limit is helping your workload.
What we suspect is that when running the workload, the threads that regularly sleep trigger a newidle balancing which causes them to move to another CCX leading to higher number of L3 misses.
To confirm this, would it be possible to run the workload with the not-yet-upstream perf sched stats [1] tool and share the result from perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch to rule out any other second order effect.
[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
But as soon as we increase worker threads count (10 instead of 5) the problem appears.
We know our decoder may have too many threads but that's out of our scope, it has been designed like that some years ago and moving from "lot of small threads to few of big thread" is for now not possible.
We have a work around : we group threads using pthread affinities. Every main thread (and by inheritance of affinities every worker threads) on a single CCX so we reduce the L3 miss for them, then decrease memory bandwidth, then finally increasing IPC.
With that solution, we go above our original performances, for both kernels, and they perform at the same level. However, it is impractical to productize as such.
I've tried many kernel build configurations (CONFIG_PREMPT_*, CONFIG_SCHEDULER_*, tuning offair.c:sysctl_sched_migration_cost) on 6.12.17, 6.12.21 (longterm), 6.13.9 (mainline), and 6.14.0 Nothing changes.
Q: Is there anyway to tune the kernel so we can get our performance back without using the pthread affinities work around ?
Assuming you control these deployments, would it possible to run the workload on a kernel running with "relax_domain_level=2" kernel cmdline that restricts newidle balance to only within the CCX. As a side effect, it also limits task wakeups to the same LLC domain but I would still like to know if this makes a difference to the workload you are running.
Note: This is a system-wide knob and will affect all workloads running on the system and is better used for debug purposes.
Hello Prateek,
thank's for your reponse.
Looking at the commit logs, it looks like these commits do solve other problems around load balancing and might not be trivial to revert without evaluating the damages.
it's definitely not a productizable workaround !
The processor you are running on, the AME EPYC 7702P based on the Zen2 architecture contains 4 cores / 8 threads per CCX (LLC domain) which is perhaps why reducing the thread count to below this limit is helping your workload.
What we suspect is that when running the workload, the threads that regularly sleep trigger a newidle balancing which causes them to move to another CCX leading to higher number of L3 misses.
To confirm this, would it be possible to run the workload with the not-yet-upstream perf sched stats [1] tool and share the result from perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch to rule out any other second order effect.
[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
I had to patch tools/perf/util/session.c : static int open_file_read(struct perf_data *data) due to "failed to open perf.data: File exists" (looked more like a compiler issue than a tool/perf issue)
$ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched > perf.diff (see perf.diff attached)
Assuming you control these deployments, would it possible to run the workload on a kernel running with "relax_domain_level=2" kernel cmdline that restricts newidle balance to only within the CCX. As a side effect, it also limits task wakeups to the same LLC domain but I would still like to know if this makes a difference to the workload you are running.
On vanilla 6.12.17 it gives the IPC we expected:
+--------------------+--------------------------+-----------------------+ | | relax_domain_level unset | relax_domain_level=2 | +--------------------+--------------------------+-----------------------+ | Threads | 210 | 210 | | Utilization (%) | 65,86 | 52,01 | | CPU effective freq | 1 622,93 | 1 294,12 | | IPC | 1,14 | 1,42 | | L2 access (pti) | 34,36 | 38,18 | | L2 miss (pti) | 7,34 | 7,78 | | L3 miss (abs) | 39 711 971 741 | 33 929 609 924 | | Mem (GB/s) | 70,68 | 49,10 | | Context switches | 109 281 524 | 107 896 729 | +--------------------+--------------------------+-----------------------+
Kind regards,
JB
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") helps the workload. Both those commits allow aggressive migrations for work conservation except it also increased cache misses which slows the workload quite a bit.
"relax_domain_level" helps but cannot be set at runtime and I couldn't think of any stable / debug interfaces that JB hasn't tried out already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at runtime but given cpusets got away with this when transitioning to cgroup-v2, I don't know what the sentiments are around its usage. Any input / feedback is greatly appreciated.
On 4/28/2025 1:13 PM, Jean-Baptiste Roquefere wrote:
Hello Prateek,
thank's for your reponse.
Looking at the commit logs, it looks like these commits do solve other problems around load balancing and might not be trivial to revert without evaluating the damages.
it's definitely not a productizable workaround !
The processor you are running on, the AME EPYC 7702P based on the Zen2 architecture contains 4 cores / 8 threads per CCX (LLC domain) which is perhaps why reducing the thread count to below this limit is helping your workload.
What we suspect is that when running the workload, the threads that regularly sleep trigger a newidle balancing which causes them to move to another CCX leading to higher number of L3 misses.
To confirm this, would it be possible to run the workload with the not-yet-upstream perf sched stats [1] tool and share the result from perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch to rule out any other second order effect.
[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
I had to patch tools/perf/util/session.c : static int open_file_read(struct perf_data *data) due to "failed to open perf.data: File exists" (looked more like a compiler issue than a tool/perf issue)
$ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched > perf.diff (see perf.diff attached)
Thank you for all the information Jean. I'll highlight the interesting bits (at least the bits that stood out to me)
(left is mainline, right is mainline with the two commits mentioned by JB reverted)
total runtime by tasks on this processor (in jiffies) : 123927676874,108531911002 | -12.42% | total waittime by tasks on this processor (in jiffies) : 34729211241, 27076295778 | -22.04% | ( 28.02%, 24.95% ) total timeslices run on this cpu : 501606, 489799 | -2.35% |
Since "total runtime" is lower on the right, it means that the CPUs were not as well utilized with the commits reverted however the reduction in the "total waittime" suggests things are running faster and on overage there are 0.28 waiting tasks on mainline compared to 0.24 with the commits reverted.
---------------------------------------- <Category newidle - SMT> ---------------------------------------- load_balance() count on cpu newly idle : 331664, 31153 | -90.61% | $ 0.15, 1.55 $ load_balance() failed to find busier group on cpu newly idle : 300234, 28470 | -90.52% | $ 0.16, 1.70 $ *load_balance() success count on cpu newly idle : 28386, 1544 | -94.56% | *avg task pulled per successful lb attempt (cpu newly idle) : 1.00, 1.01 | 0.46% | ---------------------------------------- <Category newidle - MC > ---------------------------------------- load_balance() count on cpu newly idle : 258017, 29345 | -88.63% | $ 0.19, 1.65 $ load_balance() failed to find busier group on cpu newly idle : 131096, 16081 | -87.73% | $ 0.37, 3.01 $ *load_balance() success count on cpu newly idle : 23286, 2181 | -90.63% | *avg task pulled per successful lb attempt (cpu newly idle) : 1.03, 1.01 | -1.23% | ---------------------------------------- <Category newidle - PKG> ---------------------------------------- load_balance() count on cpu newly idle : 124013, 27086 | -78.16% | $ 0.39, 1.78 $ load_balance() failed to find busier group on cpu newly idle : 11812, 3063 | -74.07% | $ 4.09, 15.78 $ *load_balance() success count on cpu newly idle : 13892, 4739 | -65.89% | *avg task pulled per successful lb attempt (cpu newly idle) : 1.07, 1.10 | 3.32% | ----------------------------------------------------------------------------------------------------------
Most migrations are from newidle balancing which seems to move task across cores ( > 50% of time) and the LLC too (~8% of the times).
Assuming you control these deployments, would it possible to run the workload on a kernel running with "relax_domain_level=2" kernel cmdline that restricts newidle balance to only within the CCX. As a side effect, it also limits task wakeups to the same LLC domain but I would still like to know if this makes a difference to the workload you are running.
On vanilla 6.12.17 it gives the IPC we expected:
Thank you JB for trying out this experiment. I'm not very sure what the views are on "relax_domain_level" and I'm hoping the other scheduler folks will chime in here - Is it a debug knob? Can it be used in production?
I know it had additional uses with cpuset in cgroup-v1 but was not adopted in v2 - are there any nasty historic reasons for this?
+--------------------+--------------------------+-----------------------+ | | relax_domain_level unset | relax_domain_level=2 | +--------------------+--------------------------+-----------------------+ | Threads | 210 | 210 | | Utilization (%) | 65,86 | 52,01 | | CPU effective freq | 1 622,93 | 1 294,12 | | IPC | 1,14 | 1,42 | | L2 access (pti) | 34,36 | 38,18 | | L2 miss (pti) | 7,34 | 7,78 | | L3 miss (abs) | 39 711 971 741 | 33 929 609 924 | | Mem (GB/s) | 70,68 | 49,10 | | Context switches | 109 281 524 | 107 896 729 | +--------------------+--------------------------+-----------------------+
Kind regards,
JB
JB asked if there is any way to toggle "relax_domain_level" at runtime on mainline and I couldn't find any easy way other than using cpusets with cgroup-v1 which is probably harder to deploy at scale than the pinning strategy that JB mentioned originally.
I currently cannot think of any stable interface that exists currently to allow sticky behavior and mitigate aggressive migration for work conservation - JB did try almost everything available that he summarized in his original report.
Could something like below be a stop-gap band-aid to remedy such the case of workloads that don't mind temporary imbalance in favor of cache hotness?
--- From: K Prateek Nayak kprateek.nayak@amd.com Subject: [RFC PATCH] sched/debug: Allow overriding "relax_domain_level" at runtime
Jean-Baptiste noted that Ateme's workload experiences poor IPC on a 2nd Generation EPYC system and narrowed down the major culprits to commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") both of which enable more aggressive migrations in favor of work conservation.
The larger C2C latency on the platform coupled with a smaller L3 size of 4C/8T makes downside of aggressive balance very prominent. Looking at the perf sched stats report from JB [1], when the two commits are reverted, despite the "total runtime" seeing a dip of 11% showing a better load distribution on mainline, the "total waittime" dips by 22% showing despite the imbalance, the workload runs faster and this improvement can be co-related to the higher IPC and the reduced L3 misses in data shared by JB. Most of the migration during load balancing can be attributed to newidle balance.
JB confirmed that using "relax_domain_level=2" in kernel cmdline helps this particular workload by restricting the scope of wakeups and migrations during newidle balancing however "relax_domain_level" works on topology levels before degeneration and setting the level before inspecting the topology might not be trivial at boot time.
Furthermore, a runtime knob that can help quickly narrow down any changes in workload behavior to aggressive migrations during load balancing can be helpful during debugs.
Introduce "relax_domain_level" in sched debugfs and allow overriding the knob at runtime.
# cat /sys/kernel/debug/sched/relax_domain_level -1
# echo Y > /sys/kernel/debug/sched/verbose # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
To restrict newidle balance to only within the LLC, "relax_domain_level" can be set to level 3 (SMT, CLUSTER, *MC* , PKG, NUMA)
# echo 3 > /sys/kernel/debug/sched/relax_domain_level # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
"relax_domain_level" forgives short term imbalances. Longer term imbalances will be eventually caught by the periodic load balancer and the system will reach a state of balance, only slightly later.
Link: https://lore.kernel.org/all/996ca8cb-3ac8-4f1b-93f1-415f43922d7a@ateme.com/ [1] Signed-off-by: K Prateek Nayak kprateek.nayak@amd.com --- include/linux/sched/topology.h | 6 ++-- kernel/sched/debug.c | 52 ++++++++++++++++++++++++++++++++++ kernel/sched/topology.c | 2 +- 3 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 198bb5cc1774..5f59bdc1d5b1 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -65,8 +65,10 @@ struct sched_domain_attr { int relax_domain_level; };
-#define SD_ATTR_INIT (struct sched_domain_attr) { \ - .relax_domain_level = -1, \ +extern int default_relax_domain_level; + +#define SD_ATTR_INIT (struct sched_domain_attr) { \ + .relax_domain_level = default_relax_domain_level, \ }
extern int sched_domain_level_max; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 557246880a7e..cc6944b35535 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -214,6 +214,57 @@ static const struct file_operations sched_scaling_fops = { .release = single_release, };
+DEFINE_MUTEX(relax_domain_mutex); + +static ssize_t sched_relax_domain_write(struct file *filp, + const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + int relax_domain_level; + char buf[16]; + + if (cnt > 15) + cnt = 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] = '\0'; + + if (kstrtoint(buf, 10, &relax_domain_level)) + return -EINVAL; + + if (relax_domain_level < -1 || relax_domain_level > sched_domain_level_max + 1) + return -EINVAL; + + guard(mutex)(&relax_domain_mutex); + + if (relax_domain_level != default_relax_domain_level) { + default_relax_domain_level = relax_domain_level; + rebuild_sched_domains(); + } + + *ppos += cnt; + return cnt; +} +static int sched_relax_domain_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", default_relax_domain_level); + return 0; +} + +static int sched_relax_domain_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_relax_domain_show, NULL); +} + +static const struct file_operations sched_relax_domain_fops = { + .open = sched_relax_domain_open, + .write = sched_relax_domain_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + #endif /* SMP */
#ifdef CONFIG_PREEMPT_DYNAMIC @@ -516,6 +567,7 @@ static __init int sched_init_debug(void) debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops); debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost); debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate); + debugfs_create_file("relax_domain_level", 0644, debugfs_sched, NULL, &sched_relax_domain_fops);
sched_domains_mutex_lock(); update_sched_domain_debugfs(); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index a2a38e1b6f18..eb5c8a9cd904 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1513,7 +1513,7 @@ static void asym_cpu_capacity_scan(void) * Non-inlined to reduce accumulated stack pressure in build_sched_domains() */
-static int default_relax_domain_level = -1; +int default_relax_domain_level = -1; int sched_domain_level_max;
static int __init setup_relax_domain_level(char *str)
On Wed, Apr 30, 2025 at 02:43:00PM +0530, K Prateek Nayak wrote:
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Seems like the kind of chip the cache aware scheduling crud should be good for. Of course, it's still early days on that, so it might not be in good enough shape to help yet.
But long term, that should definitely be the goal, rather than finding ways to make relax_domain hacks available again.
On 4/30/25 02:13, K Prateek Nayak wrote:
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") helps the workload. Both those commits allow aggressive migrations for work conservation except it also increased cache misses which slows the workload quite a bit.
"relax_domain_level" helps but cannot be set at runtime and I couldn't think of any stable / debug interfaces that JB hasn't tried out already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at runtime but given cpusets got away with this when transitioning to cgroup-v2, I don't know what the sentiments are around its usage. Any input / feedback is greatly appreciated.
Hi Prateek,
Oh no, not "relax_domain_level" again, this can lead to load imbalance in variety of ways. We were so glad this one went away with cgroupv2, it tends to be abused by users as an "easy" fix for some urgent perf issues instead of addressing their root causes.
Thanks, Libo
Hello Libo,
On 4/30/2025 4:11 PM, Libo Chen wrote:
On 4/30/25 02:13, K Prateek Nayak wrote:
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") helps the workload. Both those commits allow aggressive migrations for work conservation except it also increased cache misses which slows the workload quite a bit.
"relax_domain_level" helps but cannot be set at runtime and I couldn't think of any stable / debug interfaces that JB hasn't tried out already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at runtime but given cpusets got away with this when transitioning to cgroup-v2, I don't know what the sentiments are around its usage. Any input / feedback is greatly appreciated.
Hi Prateek,
Oh no, not "relax_domain_level" again, this can lead to load imbalance in variety of ways. We were so glad this one went away with cgroupv2,
I agree it is not pretty. JB also tried strategic pinning and they did report that things are better overall but unfortunately, it is very hard to deploy across multiple architectures and would also require some redesign + testing from their application side.
it tends to be abused by users as an "easy" fix for some urgent perf issues instead of addressing their root causes.
Was there ever a report of similar issue where migrations for right reasons has led to performance degradation as a result of platform architecture? I doubt there is a straightforward way to solve this using the current interfaces - at least I haven't found one yet.
Perhaps cache-aware scheduling is the way forward to solve these set of issues as Peter highlighted.
Thanks, Libo
Hi Prateek,
On 4/30/25 04:29, K Prateek Nayak wrote:
Hello Libo,
On 4/30/2025 4:11 PM, Libo Chen wrote:
On 4/30/25 02:13, K Prateek Nayak wrote:
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") helps the workload. Both those commits allow aggressive migrations for work conservation except it also increased cache misses which slows the workload quite a bit.
"relax_domain_level" helps but cannot be set at runtime and I couldn't think of any stable / debug interfaces that JB hasn't tried out already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at runtime but given cpusets got away with this when transitioning to cgroup-v2, I don't know what the sentiments are around its usage. Any input / feedback is greatly appreciated.
Hi Prateek,
Oh no, not "relax_domain_level" again, this can lead to load imbalance in variety of ways. We were so glad this one went away with cgroupv2,
I agree it is not pretty. JB also tried strategic pinning and they did report that things are better overall but unfortunately, it is very hard to deploy across multiple architectures and would also require some redesign + testing from their application side.
I was more of stressing broadly how bad setting "relax_domain_level" could go wrong if an user doesn't know this essentially disables newidle balancing at higher levels, so the ability to balance loads across CCXes or NUMA nodes will be a lot weaker. A subset of CCXes may consistently get much more loads due to a whole bunch of reasons. Sometimes this is hard to spot in testing, but does show up in real-world scenarios, esp. when users have other weird hacks.
it tends to be abused by users as an "easy" fix for some urgent perf issues instead of addressing their root causes.
Was there ever a report of similar issue where migrations for right reasons has led to performance degradation as a result of platform architecture? I doubt there is a straightforward way to solve this using the current interfaces - at least I haven't found one yet.
It wasn't due to platform architecture for us but more of "exotic" NUMA topology (like a cubic, a node is one hop away from 3 neighbors, two hops away from other 4) in combination with certain userlevel settings that cause more wakeups in a subset of domains. If relax_domain_level is left untouched, then you get no load imbalance but perf is bad. But once you set relax_domain_level to restrict newidle balancing to lower domain levels, you actually see better performance numbers in testing even though CPU loads are not well-balanced. Until one day, you find out the imbalance is so bad that it slows down everything. Luckily it wasn't too hard to fix from the application side.
I get it may not be easy to fix from their application side in this case and but I still think this is too hackery, one may end up regretting.
I certainly want to hear what others think about relax_domain_level!
Perhaps cache-aware scheduling is the way forward to solve these set of issues as Peter highlighted.
Hope so! We will start test that series and provide feedback
Thanks, Libo
linux-stable-mirror@lists.linaro.org