From: Saurabh Sengar ssengar@linux.microsoft.com
On a x86 system under test with 1780 CPUs, topology_span_sane() takes around 8 seconds cumulatively for all the iterations. It is an expensive operation which does the sanity of non-NUMA topology masks.
CPU topology is not something which changes very frequently hence make this check optional for the systems where the topology is trusted and need faster bootup.
Restrict this to sched_verbose kernel cmdline option so that this penalty can be avoided for the systems who want to avoid it.
Cc: stable@vger.kernel.org Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") Signed-off-by: Saurabh Sengar ssengar@linux.microsoft.com Co-developed-by: Naman Jain namjain@linux.microsoft.com Signed-off-by: Naman Jain namjain@linux.microsoft.com ---
Changes since v2: https://lore.kernel.org/all/1731922777-7121-1-git-send-email-ssengar@linux.m... - Use sched_debug() instead of using sched_debug_verbose variable directly (addressing Prateek's comment)
Changes since v1: https://lore.kernel.org/all/1729619853-2597-1-git-send-email-ssengar@linux.m... - Use kernel cmdline param instead of compile time flag.
Adding a link to the other patch which is under review. https://lore.kernel.org/lkml/20241031200431.182443-1-steve.wahl@hpe.com/ Above patch tries to optimize the topology sanity check, whereas this patch makes it optional. We believe both patches can coexist, as even with optimization, there will still be some performance overhead for this check.
--- kernel/sched/topology.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index c49aea8c1025..b030c1a2121f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2359,6 +2359,13 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl, { int i = cpu + 1;
+ /* Skip the topology sanity check for non-debug, as it is a time-consuming operatin */ + if (!sched_debug()) { + pr_info_once("%s: Skipping topology span sanity check. Use `sched_verbose` boot parameter to enable it.\n", + __func__); + return true; + } + /* NUMA levels are allowed to overlap */ if (tl->flags & SDTL_OVERLAP) return true;
base-commit: 00f3246adeeacbda0bd0b303604e46eb59c32e6e
Hello Naman,
On 2/3/2025 5:17 PM, Naman Jain wrote:
From: Saurabh Sengar ssengar@linux.microsoft.com
On a x86 system under test with 1780 CPUs, topology_span_sane() takes around 8 seconds cumulatively for all the iterations. It is an expensive operation which does the sanity of non-NUMA topology masks.
CPU topology is not something which changes very frequently hence make this check optional for the systems where the topology is trusted and need faster bootup.
Restrict this to sched_verbose kernel cmdline option so that this penalty can be avoided for the systems who want to avoid it.
Cc: stable@vger.kernel.org Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") Signed-off-by: Saurabh Sengar ssengar@linux.microsoft.com Co-developed-by: Naman Jain namjain@linux.microsoft.com Signed-off-by: Naman Jain namjain@linux.microsoft.com
Changes since v2: https://lore.kernel.org/all/1731922777-7121-1-git-send-email-ssengar@linux.m...
- Use sched_debug() instead of using sched_debug_verbose variable directly (addressing Prateek's comment)
Changes since v1: https://lore.kernel.org/all/1729619853-2597-1-git-send-email-ssengar@linux.m...
- Use kernel cmdline param instead of compile time flag.
Adding a link to the other patch which is under review. https://lore.kernel.org/lkml/20241031200431.182443-1-steve.wahl@hpe.com/ Above patch tries to optimize the topology sanity check, whereas this patch makes it optional. We believe both patches can coexist, as even with optimization, there will still be some performance overhead for this check. > --- kernel/sched/topology.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index c49aea8c1025..b030c1a2121f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2359,6 +2359,13 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl, { int i = cpu + 1;
- /* Skip the topology sanity check for non-debug, as it is a time-consuming operatin */
s/operatin/operation/
- if (!sched_debug()) {
pr_info_once("%s: Skipping topology span sanity check. Use `sched_verbose` boot parameter to enable it.\n",
This could be broken down as follows:
pr_info_once("%s: Skipping topology span sanity check." " Use `sched_verbose` boot parameter to enable it.\n", __func__);
Running:
grep -r -A 5 "pr_info(.*[^;,]$" kernel/
gives similar usage across kernel/*. Apart from those nits, feel free to add:
Tested-by: K Prateek Nayak kprateek.nayak@amd.com # x86
if the future version does not change much.
On 2/5/2025 12:50 PM, K Prateek Nayak wrote:
Hello Naman,
On 2/3/2025 5:17 PM, Naman Jain wrote:
From: Saurabh Sengar ssengar@linux.microsoft.com
On a x86 system under test with 1780 CPUs, topology_span_sane() takes
<.>
{ int i = cpu + 1; + /* Skip the topology sanity check for non-debug, as it is a time- consuming operatin */
s/operatin/operation/
+ if (!sched_debug()) { + pr_info_once("%s: Skipping topology span sanity check. Use `sched_verbose` boot parameter to enable it.\n",
This could be broken down as follows:
pr_info_once("%s: Skipping topology span sanity check." " Use `sched_verbose` boot parameter to enable it.\n", __func__);
Running:
grep -r -A 5 "pr_info(.*[^;,]$" kernel/
gives similar usage across kernel/*. Apart from those nits, feel free to add:
Tested-by: K Prateek Nayak kprateek.nayak@amd.com # x86
if the future version does not change much.
Hello Prateek, Thanks for reviewing and testing this. I'll make changes based on your feedback in next version.
Regards, Naman
On 2/5/2025 12:53 PM, Naman Jain wrote:
On 2/5/2025 12:50 PM, K Prateek Nayak wrote:
Hello Naman,
On 2/3/2025 5:17 PM, Naman Jain wrote:
From: Saurabh Sengar ssengar@linux.microsoft.com
On a x86 system under test with 1780 CPUs, topology_span_sane() takes
<.>
{ int i = cpu + 1; + /* Skip the topology sanity check for non-debug, as it is a time- consuming operatin */
s/operatin/operation/
+ if (!sched_debug()) { + pr_info_once("%s: Skipping topology span sanity check. Use `sched_verbose` boot parameter to enable it.\n",
This could be broken down as follows:
pr_info_once("%s: Skipping topology span sanity check." " Use `sched_verbose` boot parameter to enable it.\n", __func__);
Running:
grep -r -A 5 "pr_info(.*[^;,]$" kernel/
gives similar usage across kernel/*. Apart from those nits, feel free to add:
Tested-by: K Prateek Nayak kprateek.nayak@amd.com # x86
if the future version does not change much.
Hello Prateek, Thanks for reviewing and testing this. I'll make changes based on your feedback in next version.
Regards, Naman
Hi Prateek, After breaking down the print msg based on your suggestion, checkpatch gives a warning. There are no warnings reported with current version of change. Even the fix suggested by checkpatch is aligned to what we have right now. So I'll keep it like this, not push further changes as of now and wait for the maintainers to pick the patch.
WARNING: quoted string split across lines #57: FILE: kernel/sched/topology.c:2365: + pr_info_once("%s: Skipping topology span sanity check." + " Use `sched_verbose` boot parameter to enable it.\n",
total: 0 errors, 1 warnings, 14 lines checked
Regards, Naman
Hello all,
On 2/3/2025 5:17 PM, Naman Jain wrote:
[..snip..]
Adding a link to the other patch which is under review. https://lore.kernel.org/lkml/20241031200431.182443-1-steve.wahl@hpe.com/ Above patch tries to optimize the topology sanity check, whereas this patch makes it optional. We believe both patches can coexist, as even with optimization, there will still be some performance overhead for this check.
I would like to discuss this parallelly here. Going back to the original problem highlighted in [1], the topology_span_sane() came to be as a result of how drivers/base/arch_topology.c computed the cpu_coregroup_mask().
[1] https://lore.kernel.org/all/1577088979-8545-1-git-send-email-prime.zeng@hisi...
Originally described problematic topology is as follows:
************************** NUMA: 0-2, 3-7 core_siblings: 0-3, 4-7 **************************
with the problematic bit in the handling being:
const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
...
if (last_level_cache_is_valid(cpu)) { /* If the llc_sibling is subset of node return llc_sibling */ if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) core_mask = &cpu_topology[cpu].llc_sibling;
/* else the core_mask remains cpumask_of_node() */ }
...
return core_mask; }
For CPU3, the llc_sibling 0-3 is not a subset of node mask 3-7, and the fallback is to use 3-7. For CPUs 4-7, the llc_sibling 4-7 is a subset of the node mask 3-7 and the coremask is returned a 4-7.
In case of x86 (and perhaps other arch too) the arch/x86 bits ensure that this inconsistency never happens for !NUMA domains using the topology IDs. If a set of IDs match between two CPUs, the CPUs are set in each other's per-CPU topology mask (see link_mask() usage and match_*() functions in arch/x86/kernel/smpboot.c)
If the set of IDs match with one CPU, it should match with all other CPUs set in the cpumask for a given topology level. If it doesn't match with one, it will not match with any other CPUs in the cpumask either. The cpumasks of two CPUs can either be equal or disjoint at any given level. Steve's optimization reverses this to check if the the cpumask of set of CPUs match.
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped? Looking at the implementation it does not seem possible (at least to my eyes) with one exception of AMD Fam 0x15 processors which set "cu_id" and match_smt() will look at cu_id if the core_id doesn't match between 2 CPUs. It may so happen that core IDs may match with one set of CPUs and cu_id may match with another set of CPUs if the information from CPUID is faulty.
What I'm getting to is that the arch specific topology parsing code can set a "SDTL_ARCH_VERIFIED" flag indicating that the arch specific bits have verified that the cpumasks are either equal or disjoint and since sched_debug() is "false" by default, topology_span_sane() can bail out if:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
In case arch specific parsing was wrong, "sched_verbose" can always be used to double check the topology and for the arch that require this sanity check, Steve's optimized version of topology_span_sane() can be run to be sure of the sanity.
All this justification is in case folks want to keep topology_span_sane() around but if no one cares, Naman and Saurabh's approach works as intended.
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
But I think there were some really dodgy VMs too.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Hello Peter,
Thank you for the background!
On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
But I think there were some really dodgy VMs too.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Has it ever happened without tripping the topology_sane() check first on the x86 side?
On Wed, Feb 05, 2025 at 03:43:54PM +0530, K Prateek Nayak wrote:
Hello Peter,
Thank you for the background!
On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
But I think there were some really dodgy VMs too.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Has it ever happened without tripping the topology_sane() check first on the x86 side?
That I can't remember, sorry :/
Hello Peter,
On 2/5/2025 3:46 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:43:54PM +0530, K Prateek Nayak wrote:
Hello Peter,
Thank you for the background!
On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
I'm 99% sure that this might have been the topology_sane() check on the x86 side and not the topology_span_sane() check in kernel/sched/topology.c
I believe one of the original changes that did the plumbing for SNC was commit 2c88d45edbb8 ("x86, sched: Treat Intel SNC topology as default, COD as exception") from Alison where they mentions that they saw the following splat when running with SNC:
sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This comes from the topology_sane() check in arch/x86/boot/smpboot.c and match_llc() on x86 side was modified to work around that.
But I think there were some really dodgy VMs too.
For VMs too, it is easy to trip topology_sane() check on x86 side. With QEMU, I can run:
qemu-system-x86_64 -enable-kvm -cpu host \ -smp cpus=32,sockets=2,cores=8,threads=2 \ ... -numa node,cpus=0-7,cpus=16-23,memdev=m0,nodeid=0 \ -numa node,cpus=8-15,cpus=24-31,memdev=m1,nodeid=1 \ ...
and I get:
sched: CPU #8's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This is because consecutive CPUs (0-1,2-3,...) are SMT siblings and CPUs 0-15 are on the same socket as a result of how QEMU presents MADT to the guest but then I go ahead and mess things up by saying CPUs 0-7,16-23 are on one NUMA node, and the rest are on the other.
I still haven't managed to trip topology_span_sane() tho.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Has it ever happened without tripping the topology_sane() check first on the x86 side?
What topology_span_sane() does is, it iterates over all the CPUs at a given topology level and makes sure that the cpumask for a CPU at that domain is same as the cpumask of every other CPU set on that mask for that topology level.
If two CPUs are set on a mask, they should have the same mask. If CPUs are not set on each other's mask, the masks should be disjoint.
On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's shared masks iff match_*() returns true:
o For SMT, this means:
- If X86_FEATURE_TOPOEXT is set: - pkg_id must match. - die_id must match. - amd_node_id must match. - llc_id must match. - Either core_id or cu_id must match. (*) - NUMA nodes must match.
- If !X86_FEATURE_TOPOEXT: - pkg_id must match. - die_id must match. - core_id must match. - NUMA nodes must match.
o For CLUSTER this means:
- If l2c_id is not populated (== BAD_APICID) - Same conditions as SMT.
- If l2c_id is populated (!= BAD_APICID) - l2c_id must match. - NUMA nodes must match.
o For MC it means:
- llc_id must be populated (!= BAD_APICID) and must match. - If INTEL_SNC: pkg_id must match. - If !INTEL_SNC: NUMA nodes must match.
o For PKG domain:
- Inserted only if !x86_has_numa_in_package. - CPUs should be in same NUMA node.
All in all, other that the one (*) decision point, everything else has to strictly match for CPUs to be set in each other's CPU mask. And if they match with one CPU, they should match will all other CPUs in mask and it they mismatch with one, they should mismatch with all leading to link_mask() never being called.
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
(Sorry for the long answer explaining my thought process.)
That I can't remember, sorry :/
On 2/6/2025 2:40 PM, K Prateek Nayak wrote:
Hello Peter,
On 2/5/2025 3:46 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:43:54PM +0530, K Prateek Nayak wrote:
Hello Peter,
Thank you for the background!
On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped?
At the very least Intel SNC 'feature' tripped it at some point. They figured it made sense to have the LLC span two nodes.
I'm 99% sure that this might have been the topology_sane() check on the x86 side and not the topology_span_sane() check in kernel/sched/topology.c
I believe one of the original changes that did the plumbing for SNC was commit 2c88d45edbb8 ("x86, sched: Treat Intel SNC topology as default, COD as exception") from Alison where they mentions that they saw the following splat when running with SNC:
sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This comes from the topology_sane() check in arch/x86/boot/smpboot.c and match_llc() on x86 side was modified to work around that.
But I think there were some really dodgy VMs too.
For VMs too, it is easy to trip topology_sane() check on x86 side. With QEMU, I can run:
qemu-system-x86_64 -enable-kvm -cpu host \ -smp cpus=32,sockets=2,cores=8,threads=2 \ ... -numa node,cpus=0-7,cpus=16-23,memdev=m0,nodeid=0 \ -numa node,cpus=8-15,cpus=24-31,memdev=m1,nodeid=1 \ ...
and I get:
sched: CPU #8's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
This is because consecutive CPUs (0-1,2-3,...) are SMT siblings and CPUs 0-15 are on the same socket as a result of how QEMU presents MADT to the guest but then I go ahead and mess things up by saying CPUs 0-7,16-23 are on one NUMA node, and the rest are on the other.
I still haven't managed to trip topology_span_sane() tho.
But yeah, its not been often. But basically dodgy BIOS/VM data can mess up things badly enough for it to trip.
Has it ever happened without tripping the topology_sane() check first on the x86 side?
What topology_span_sane() does is, it iterates over all the CPUs at a given topology level and makes sure that the cpumask for a CPU at that domain is same as the cpumask of every other CPU set on that mask for that topology level.
If two CPUs are set on a mask, they should have the same mask. If CPUs are not set on each other's mask, the masks should be disjoint.
On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's shared masks iff match_*() returns true:
o For SMT, this means:
- If X86_FEATURE_TOPOEXT is set: - pkg_id must match. - die_id must match. - amd_node_id must match. - llc_id must match. - Either core_id or cu_id must match. (*) - NUMA nodes must match.
- If !X86_FEATURE_TOPOEXT: - pkg_id must match. - die_id must match. - core_id must match. - NUMA nodes must match.
o For CLUSTER this means:
- If l2c_id is not populated (== BAD_APICID) - Same conditions as SMT.
- If l2c_id is populated (!= BAD_APICID) - l2c_id must match. - NUMA nodes must match.
o For MC it means:
- llc_id must be populated (!= BAD_APICID) and must match. - If INTEL_SNC: pkg_id must match. - If !INTEL_SNC: NUMA nodes must match.
o For PKG domain: - Inserted only if !x86_has_numa_in_package. - CPUs should be in same NUMA node.
All in all, other that the one (*) decision point, everything else has to strictly match for CPUs to be set in each other's CPU mask. And if they match with one CPU, they should match will all other CPUs in mask and it they mismatch with one, they should mismatch with all leading to link_mask() never being called.
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
(Sorry for the long answer explaining my thought process.)
Thanks for sharing your valuable insights. I am sorry, I could not find SDTL_ARCH_VERIFIED in linux-next tip. Am I missing something?
Regards, Naman
Hello Naman,
On 2/6/2025 3:17 PM, Naman Jain wrote:
[..snip..]
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
(Sorry for the long answer explaining my thought process.)
Thanks for sharing your valuable insights. I am sorry, I could not find SDTL_ARCH_VERIFIED in linux-next tip. Am I missing something?
It does not exits yet. I was proposing on defining this new flag "SDTL_ARCH_VERIFIED" which a particular arch can set if the topology parsing code has taken care of making sure that the cpumasks cannot overlap. The original motivation for topology_span_sane() discussed in [1] came from an ARM processor where the functions that returns the cpumask is not based on ID checks and can possibly allow overlapping masks.
With the exception of AMD Fam 0x15 processors which populates cu_id (and that too it is theoretical case), I believe all x86 processors can set this new flag "SDTL_ARCH_VERIFIED" and can safely skip the topology_span_sane() since it checks for a condition that cannot possibly be false as result of how these masks are built on x86.
[1] https://lore.kernel.org/lkml/f6bf04e8-3007-4a44-86d8-2cc671c85247@amd.com/
Regards, Naman
On 2/6/2025 3:49 PM, K Prateek Nayak wrote:
Hello Naman,
On 2/6/2025 3:17 PM, Naman Jain wrote:
[..snip..]
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
(Sorry for the long answer explaining my thought process.)
Thanks for sharing your valuable insights. I am sorry, I could not find SDTL_ARCH_VERIFIED in linux-next tip. Am I missing something?
It does not exits yet. I was proposing on defining this new flag "SDTL_ARCH_VERIFIED" which a particular arch can set if the topology parsing code has taken care of making sure that the cpumasks cannot overlap. The original motivation for topology_span_sane() discussed in [1] came from an ARM processor where the functions that returns the cpumask is not based on ID checks and can possibly allow overlapping masks.
With the exception of AMD Fam 0x15 processors which populates cu_id (and that too it is theoretical case), I believe all x86 processors can set this new flag "SDTL_ARCH_VERIFIED" and can safely skip the topology_span_sane() since it checks for a condition that cannot possibly be false as result of how these masks are built on x86.
[1] https://lore.kernel.org/lkml/ f6bf04e8-3007-4a44-86d8-2cc671c85247@amd.com/
I think the check for sched_debug() should suffice here, without making it more complicated. This way, we give the control to the user to have it or not. I'll wait for a few more days to get any additional feedback and post v4 with your initial review comments addressed.
Regards, Naman
On 06/02/25 14:40, K Prateek Nayak wrote:
What topology_span_sane() does is, it iterates over all the CPUs at a given topology level and makes sure that the cpumask for a CPU at that domain is same as the cpumask of every other CPU set on that mask for that topology level.
If two CPUs are set on a mask, they should have the same mask. If CPUs are not set on each other's mask, the masks should be disjoint.
On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's shared masks iff match_*() returns true:
o For SMT, this means:
If X86_FEATURE_TOPOEXT is set:
- pkg_id must match.
- die_id must match.
- amd_node_id must match.
- llc_id must match.
- Either core_id or cu_id must match. (*)
- NUMA nodes must match.
If !X86_FEATURE_TOPOEXT:
- pkg_id must match.
- die_id must match.
- core_id must match.
- NUMA nodes must match.
o For CLUSTER this means:
If l2c_id is not populated (== BAD_APICID)
- Same conditions as SMT.
If l2c_id is populated (!= BAD_APICID)
- l2c_id must match.
- NUMA nodes must match.
o For MC it means:
- llc_id must be populated (!= BAD_APICID) and must match.
- If INTEL_SNC: pkg_id must match.
- If !INTEL_SNC: NUMA nodes must match.
o For PKG domain:
- Inserted only if !x86_has_numa_in_package.
- CPUs should be in same NUMA node.
All in all, other that the one (*) decision point, everything else has to strictly match for CPUs to be set in each other's CPU mask. And if they match with one CPU, they should match will all other CPUs in mask and it they mismatch with one, they should mismatch with all leading to link_mask() never being called.
Nice summary, thanks for that - I'm not that familiar with the x86 topology faff.
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
If/when possible I prefer to have sanity checks run unconditionally, as long as they don't noticeably impact runtime. Unfortunately this does show up in the boot time, though Steve had a promising improvement for that.
Anyway, if someone gets one of those hangs on a
do { } while (group != sd->groups)
they'll quickly turn on sched_verbose (or be told to) and the sanity check will holler at them, so I'm not entirely against it.
(Sorry for the long answer explaining my thought process.)
That I can't remember, sorry :/
-- Thanks and Regards, Prateek
On Thu, Feb 06, 2025 at 04:24:17PM +0100, Valentin Schneider wrote:
If/when possible I prefer to have sanity checks run unconditionally, as long as they don't noticeably impact runtime. Unfortunately this does show up in the boot time, though Steve had a promising improvement for that.
Just to let you know, I had other tasks interrupt me, but I'm back to looking at this and you should see a new version from me within the next few days.
Thanks,
--> Steve
On 2/6/2025 8:54 PM, Valentin Schneider wrote:
On 06/02/25 14:40, K Prateek Nayak wrote:
What topology_span_sane() does is, it iterates over all the CPUs at a given topology level and makes sure that the cpumask for a CPU at that domain is same as the cpumask of every other CPU set on that mask for that topology level.
If two CPUs are set on a mask, they should have the same mask. If CPUs are not set on each other's mask, the masks should be disjoint.
On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's shared masks iff match_*() returns true:
o For SMT, this means:
- If X86_FEATURE_TOPOEXT is set: - pkg_id must match. - die_id must match. - amd_node_id must match. - llc_id must match. - Either core_id or cu_id must match. (*) - NUMA nodes must match. - If !X86_FEATURE_TOPOEXT: - pkg_id must match. - die_id must match. - core_id must match. - NUMA nodes must match.
o For CLUSTER this means:
- If l2c_id is not populated (== BAD_APICID) - Same conditions as SMT. - If l2c_id is populated (!= BAD_APICID) - l2c_id must match. - NUMA nodes must match.
o For MC it means:
- llc_id must be populated (!= BAD_APICID) and must match. - If INTEL_SNC: pkg_id must match. - If !INTEL_SNC: NUMA nodes must match.
o For PKG domain:
- Inserted only if !x86_has_numa_in_package. - CPUs should be in same NUMA node.
All in all, other that the one (*) decision point, everything else has to strictly match for CPUs to be set in each other's CPU mask. And if they match with one CPU, they should match will all other CPUs in mask and it they mismatch with one, they should mismatch with all leading to link_mask() never being called.
Nice summary, thanks for that - I'm not that familiar with the x86 topology faff.
This is why I think that the topology_span_sane() check is redundant when the x86 bits have already ensured masks cannot overlap in all cases except for potentially in the (*) case.
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
If/when possible I prefer to have sanity checks run unconditionally, as long as they don't noticeably impact runtime. Unfortunately this does show up in the boot time, though Steve had a promising improvement for that.
Anyway, if someone gets one of those hangs on a
do { } while (group != sd->groups)
they'll quickly turn on sched_verbose (or be told to) and the sanity check will holler at them, so I'm not entirely against it.
Thanks for the feedback :)
Regards, Naman
Hello Valentin,
On 2/6/2025 8:54 PM, Valentin Schneider wrote:
[..snip..]
So circling back to my original question around "SDTL_ARCH_VERIFIED", would folks be okay to an early bailout from topology_span_sane() on:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
and more importantly, do folks care enough about topology_span_sane() to have it run on other architectures and not just have it guarded behind just "sched_debug()" which starts off as false by default?
If/when possible I prefer to have sanity checks run unconditionally, as long as they don't noticeably impact runtime. Unfortunately this does show up in the boot time, though Steve had a promising improvement for that.
Anyway, if someone gets one of those hangs on a
do { } while (group != sd->groups)
they'll quickly turn on sched_verbose (or be told to) and the sanity check will holler at them, so I'm not entirely against it.
If you're game, I'm too!
I just put it out there in case folks had any strong feelings against this on other arch but that doesn't seem to be the case and we all love a simple solution :)
(Sorry for the long answer explaining my thought process.)
That I can't remember, sorry :/
-- Thanks and Regards, Prateek
On 2/5/25 15:18, K Prateek Nayak wrote:
Hello all,
On 2/3/2025 5:17 PM, Naman Jain wrote:
[..snip..]
Adding a link to the other patch which is under review. https://lore.kernel.org/lkml/20241031200431.182443-1-steve.wahl@hpe.com/ Above patch tries to optimize the topology sanity check, whereas this patch makes it optional. We believe both patches can coexist, as even with optimization, there will still be some performance overhead for this check.
I would like to discuss this parallelly here. Going back to the original problem highlighted in [1], the topology_span_sane() came to be as a result of how drivers/base/arch_topology.c computed the cpu_coregroup_mask().
[1] https://lore.kernel.org/all/1577088979-8545-1-git-send-email- prime.zeng@hisilicon.com/
Originally described problematic topology is as follows:
************************** NUMA: 0-2, 3-7 core_siblings: 0-3, 4-7 **************************
with the problematic bit in the handling being:
const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
...
if (last_level_cache_is_valid(cpu)) { /* If the llc_sibling is subset of node return llc_sibling */ if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) core_mask = &cpu_topology[cpu].llc_sibling;
/* else the core_mask remains cpumask_of_node() */ }
...
return core_mask; }
For CPU3, the llc_sibling 0-3 is not a subset of node mask 3-7, and the fallback is to use 3-7. For CPUs 4-7, the llc_sibling 4-7 is a subset of the node mask 3-7 and the coremask is returned a 4-7.
In case of x86 (and perhaps other arch too) the arch/x86 bits ensure that this inconsistency never happens for !NUMA domains using the topology IDs. If a set of IDs match between two CPUs, the CPUs are set in each other's per-CPU topology mask (see link_mask() usage and match_*() functions in arch/x86/kernel/smpboot.c)
If the set of IDs match with one CPU, it should match with all other CPUs set in the cpumask for a given topology level. If it doesn't match with one, it will not match with any other CPUs in the cpumask either. The cpumasks of two CPUs can either be equal or disjoint at any given level. Steve's optimization reverses this to check if the the cpumask of set of CPUs match.
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped? Looking at the implementation it does not seem possible (at least to my eyes) with one exception of AMD Fam 0x15 processors which set "cu_id" and match_smt() will look at cu_id if the core_id doesn't match between 2 CPUs. It may so happen that core IDs may match with one set of CPUs and cu_id may match with another set of CPUs if the information from CPUID is faulty.
What I'm getting to is that the arch specific topology parsing code can set a "SDTL_ARCH_VERIFIED" flag indicating that the arch specific bits have verified that the cpumasks are either equal or disjoint and since sched_debug() is "false" by default, topology_span_sane() can bail out if:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
it would simpler to use sched_debug(). no?
Since it can be enabled at runtime by "echo Y > verbose", if one one needs to enable even after boot. Wouldn't that suffice to run topology_span_sane by doing a hotplug?
In case arch specific parsing was wrong, "sched_verbose" can always be used to double check the topology and for the arch that require this sanity check, Steve's optimized version of topology_span_sane() can be run to be sure of the sanity.
All this justification is in case folks want to keep topology_span_sane() around but if no one cares, Naman and Saurabh's approach works as intended.
On 2/11/2025 11:22 AM, Shrikanth Hegde wrote:
On 2/5/25 15:18, K Prateek Nayak wrote:
Hello all,
On 2/3/2025 5:17 PM, Naman Jain wrote:
[..snip..]
Adding a link to the other patch which is under review. https://lore.kernel.org/lkml/20241031200431.182443-1-steve.wahl@hpe.com/ Above patch tries to optimize the topology sanity check, whereas this patch makes it optional. We believe both patches can coexist, as even with optimization, there will still be some performance overhead for this check.
I would like to discuss this parallelly here. Going back to the original problem highlighted in [1], the topology_span_sane() came to be as a result of how drivers/base/arch_topology.c computed the cpu_coregroup_mask().
[1] https://lore.kernel.org/all/1577088979-8545-1-git-send-email- prime.zeng@hisilicon.com/
Originally described problematic topology is as follows:
************************** NUMA: 0-2, 3-7 core_siblings: 0-3, 4-7 **************************
with the problematic bit in the handling being:
const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
...
if (last_level_cache_is_valid(cpu)) { /* If the llc_sibling is subset of node return llc_sibling */ if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) core_mask = &cpu_topology[cpu].llc_sibling;
/* else the core_mask remains cpumask_of_node() */ }
...
return core_mask; }
For CPU3, the llc_sibling 0-3 is not a subset of node mask 3-7, and the fallback is to use 3-7. For CPUs 4-7, the llc_sibling 4-7 is a subset of the node mask 3-7 and the coremask is returned a 4-7.
In case of x86 (and perhaps other arch too) the arch/x86 bits ensure that this inconsistency never happens for !NUMA domains using the topology IDs. If a set of IDs match between two CPUs, the CPUs are set in each other's per-CPU topology mask (see link_mask() usage and match_*() functions in arch/x86/kernel/smpboot.c)
If the set of IDs match with one CPU, it should match with all other CPUs set in the cpumask for a given topology level. If it doesn't match with one, it will not match with any other CPUs in the cpumask either. The cpumasks of two CPUs can either be equal or disjoint at any given level. Steve's optimization reverses this to check if the the cpumask of set of CPUs match.
Have there been any reports on an x86 system / VM where topology_span_sane() was tripped? Looking at the implementation it does not seem possible (at least to my eyes) with one exception of AMD Fam 0x15 processors which set "cu_id" and match_smt() will look at cu_id if the core_id doesn't match between 2 CPUs. It may so happen that core IDs may match with one set of CPUs and cu_id may match with another set of CPUs if the information from CPUID is faulty.
What I'm getting to is that the arch specific topology parsing code can set a "SDTL_ARCH_VERIFIED" flag indicating that the arch specific bits have verified that the cpumasks are either equal or disjoint and since sched_debug() is "false" by default, topology_span_sane() can bail out if:
if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) return;
it would simpler to use sched_debug(). no?
Since it can be enabled at runtime by "echo Y > verbose", if one one needs to enable even after boot. Wouldn't that suffice to run topology_span_sane by doing a hotplug?
I agree with your point. We are keeping it the same. Thanks.
Regards, Naman
In case arch specific parsing was wrong, "sched_verbose" can always be used to double check the topology and for the arch that require this sanity check, Steve's optimized version of topology_span_sane() can be run to be sure of the sanity.
All this justification is in case folks want to keep topology_span_sane() around but if no one cares, Naman and Saurabh's approach works as intended.
linux-stable-mirror@lists.linaro.org