Hello Peter,
On 2/5/2025 3:46 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:43:54PM +0530, K Prateek Nayak wrote:
Hello Peter,
Thank you for the background!
On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
I'm 99% sure that this might have been the topology_sane() check on the x86 side and not the topology_span_sane() check in kernel/sched/topology.c
I believe one of the original changes that did the plumbing for SNC was commit 2c88d45edbb8 ("x86, sched: Treat Intel SNC topology as default, COD as exception") from Alison where they mentions that they saw the following splat when running with SNC:
sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This comes from the topology_sane() check in arch/x86/boot/smpboot.c and match_llc() on x86 side was modified to work around that.
But I think there were some really dodgy VMs too.
For VMs too, it is easy to trip topology_sane() check on x86 side. With QEMU, I can run:
qemu-system-x86_64 -enable-kvm -cpu host \ -smp cpus=32,sockets=2,cores=8,threads=2 \ ... -numa node,cpus=0-7,cpus=16-23,memdev=m0,nodeid=0 \ -numa node,cpus=8-15,cpus=24-31,memdev=m1,nodeid=1 \ ...
and I get:
sched: CPU #8's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This is because consecutive CPUs (0-1,2-3,...) are SMT siblings and CPUs 0-15 are on the same socket as a result of how QEMU presents MADT to the guest but then I go ahead and mess things up by saying CPUs 0-7,16-23 are on one NUMA node, and the rest are on the other.
I still haven't managed to trip topology_span_sane() tho.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Has it ever happened without tripping the topology_sane() check first on the x86 side?
What topology_span_sane() does is, it iterates over all the CPUs at a given topology level and makes sure that the cpumask for a CPU at that domain is same as the cpumask of every other CPU set on that mask for that topology level.
If two CPUs are set on a mask, they should have the same mask. If CPUs are not set on each other's mask, the masks should be disjoint.
On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's shared masks iff match_*() returns true:
o For SMT, this means:
- If X86_FEATURE_TOPOEXT is set: - pkg_id must match. - die_id must match. - amd_node_id must match. - llc_id must match. - Either core_id or cu_id must match. (*) - NUMA nodes must match.
- If !X86_FEATURE_TOPOEXT: - pkg_id must match. - die_id must match. - core_id must match. - NUMA nodes must match.
o For CLUSTER this means:
- If l2c_id is not populated (== BAD_APICID) - Same conditions as SMT.
- If l2c_id is populated (!= BAD_APICID) - l2c_id must match. - NUMA nodes must match.
o For MC it means:
- llc_id must be populated (!= BAD_APICID) and must match. - If INTEL_SNC: pkg_id must match. - If !INTEL_SNC: NUMA nodes must match.
o For PKG domain:
- Inserted only if !x86_has_numa_in_package. - CPUs should be in same NUMA node.
All in all, other that the one (*) decision point, everything else has to strictly match for CPUs to be set in each other's CPU mask. And if they match with one CPU, they should match will all other CPUs in mask and it they mismatch with one, they should mismatch with all leading to link_mask() never being called.
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
(Sorry for the long answer explaining my thought process.)
That I can't remember, sorry :/