Hi Maciej,
On 7/4/24 12:23 AM, Maciej Wieczor-Retman wrote:
On 2024-07-03 at 13:51:03 -0700, Reinette Chatre wrote:
On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
On 3.07.2024 00:21, Reinette Chatre wrote:
On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
...
SNC might not be enabled at all so there would be no reason to send the user to check their BIOS - instead they can learn they have offline CPUs and they can work on fixing that. In my opinion it could be beneficial to have more specialized messages in the selftests to help users diagnose problems quicker.
My goal is indeed to have specialized messages. There cannot be a specialized message if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be enabled and it is up to the user to investigate further.
How about this in cmt_run_test() for example:
if (snc_unreliable) ksft_print_msg("Intel CMT may be inaccurate or inefficient when Sub-NUMA Clustering is enabled and not properly detected.\n");
It is ok with me if you want to keep the message even if the test succeeds. Without seeing the new implementation it is unclear to me why the SNC check below is guarded by an ARCH_INTEL check while the one above is not. Ideally this should be consistent to not confuse how the architectures need to be treated here.
The message does sound a bit vague to me about being able to detect SNC. How about something like: Sub-NUMA Clustering could not be detected properly (see earlier messages for details). Intel CMT may be inaccurate.
else if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support()) ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
The "Check BIOS configuration" guidance is not clear to me. If the kernel does not support SNC then the user could also be guided to upgrade their kernel? Perhaps that snippet can just be dropped? To be more specific that SNC enabling is not a kernel configuration but a system configuration this can read (please feel free to improve): Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.
This way there is a generic message when snc_unreliable == 1.
And as you mentioned at the end of this email, the user can be expected to backtrack to the beginning of the test if there are any problems so they can discover the exact source of the issue - offline cpus.
Having only this one message wihtout the "if snc unreliable" messages would mean nothing would get printed at the end on success with unreliable SNC detection.
Having a pass/fail is what user will focus on. If the test passes then SNC detection should not matter. The messages are just there to help user root cause where a failure may be.
My train of thought was that if test passes with broken SNC detection it means SNC was used inefficiently right? (either the portion of L3 used was bigger or smaller than that allocated for one cluster)
It's not exactly a failure but I thought it deserves a warning at the very end to alert the user.
If you don't think the warning should be printed on success I guess the condition can be: if (ret && snc_unreliable) and the user can look at the start of the test if they care about more information. And the message can lose the "inefficient" word since it would only happen on error.
Reinette