Re: IPC drop down on AMD epyc 7702P

1 May 2025


      Hi Prateek,
On 4/30/25 04:29, K Prateek Nayak wrote:
...
Hello Libo,
On 4/30/2025 4:11 PM, Libo Chen wrote:
...
On 4/30/25 02:13, K Prateek Nayak wrote:
...
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation
EPYC platform that has a small LLC domain (4C/8T) and very noticeable
C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af
("sched/fair: Ensure tasks spreading in LLC during LB") and commit
c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") helps the workload. Both those commits allow aggressive
migrations for work conservation except it also increased cache
misses which slows the workload quite a bit.
"relax_domain_level" helps but cannot be set at runtime and I couldn't
think of any stable / debug interfaces that JB hasn't tried out
already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at
runtime but given cpusets got away with this when transitioning to
cgroup-v2, I don't know what the sentiments are around its usage.
Any input / feedback is greatly appreciated.
Hi Prateek,
Oh no, not "relax_domain_level" again, this can lead to load imbalance
in variety of ways. We were so glad this one went away with cgroupv2,
I agree it is not pretty. JB also tried strategic pinning and they
did report that things are better overall but unfortunately, it is
very hard to deploy across multiple architectures and would also
require some redesign + testing from their application side.
I was more of stressing broadly how bad setting "relax_domain_level"
could go wrong if an user doesn't know this essentially disables newidle
balancing at higher levels, so the ability to balance loads across CCXes
or NUMA nodes will be a lot weaker. A subset of CCXes may consistently
get much more loads due to a whole bunch of reasons. Sometimes this is
hard to spot in testing, but does show up in real-world scenarios, esp.
when users have other weird hacks.
...
...
it tends to be abused by users as an "easy" fix for some urgent perf
issues instead of addressing their root causes.
Was there ever a report of similar issue where migrations for right
reasons has led to performance degradation as a result of platform
architecture? I doubt there is a straightforward way to solve this
using the current interfaces - at least I haven't found one yet.
It wasn't due to platform architecture for us but more of "exotic" NUMA
topology (like a cubic, a node is one hop away from 3 neighbors, two
hops away from other 4) in combination with certain userlevel settings
that cause more wakeups in a subset of domains. If relax_domain_level
is left untouched, then you get no load imbalance but perf is bad. But
once you set relax_domain_level to restrict newidle balancing to lower
domain levels, you actually see better performance numbers in testing
even though CPU loads are not well-balanced. Until one day, you find
out the imbalance is so bad that it slows down everything. Luckily it
wasn't too hard to fix from the application side.
I get it may not be easy to fix from their application side in this
case and but I still think this is too hackery, one may end up
regretting.
I certainly want to hear what others think about relax_domain_level!
...
Perhaps cache-aware scheduling is the way forward to solve these
set of issues as Peter highlighted.
Hope so! We will start test that series and provide feedback
Thanks,
Libo

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: IPC drop down on AMD epyc 7702P