Hi Maciej,
On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
On 3.07.2024 00:21, Reinette Chatre wrote:
On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c index 1ff1104e6575..9885d64b8a21 100644 --- a/tools/testing/selftests/resctrl/cache.c +++ b/tools/testing/selftests/resctrl/cache.c @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool ksft_print_msg("Average LLC val: %llu\n", avg_llc_val); ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes", cache_span); + if (snc_unreliable) + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
The message abour SNC detection being unreliable is already printed at beginning of every test so I do not think it is necessary to print it again at this point.
The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
There is more about this later, but this can be printed at start of each test.
That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static variable inside get_snc_mode() prevents the detection from running more than once when other places call get_snc_mode() (like when the cache size is adjusted).
The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test() can just do a:
int snc_mode; /* new name welcome */
snc_mode = snc_nodes_per_l3_cache(); if (snc_mode > 1) ksft_print_msg("SNC-%d mode discovered\n", snc_mode); else if (snc_unreliable) ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
And as we discussed last time it's beneficial to put error messages at the end of the test in case the user misses the initial warning at the very beginning.
Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info() as well as from run*test(). I expect "the end" to be a single place.
} diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index 0c045080d808..588543ae2654 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param goto out; ret = check_results(¶m, span, n); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
This message does seem to still be applicable if snc_unreliable == 1.
I was going for this one error message to specifically catch the kernel not having snc support for resctrl while snc is enabled. While the above message could be true when snc_unreliable == 1, it doesn't have to.
If a test fails when snc_unreliable == 1 then nothing is certain and some generic message is needed.
SNC might not be enabled at all so there would be no reason to send the user to check their BIOS - instead they can learn they have offline CPUs and they can work on fixing that. In my opinion it could be beneficial to have more specialized messages in the selftests to help users diagnose problems quicker.
My goal is indeed to have specialized messages. There cannot be a specialized message if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be enabled and it is up to the user to investigate further.
Having only this one message wihtout the "if snc unreliable" messages would mean nothing would get printed at the end on success with unreliable SNC detection.
Having a pass/fail is what user will focus on. If the test passes then SNC detection should not matter. The messages are just there to help user root cause where a failure may be.
...
volatile int *value_sink = &sink_target; static struct resctrl_test *resctrl_tests[] = { @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p if (test->disabled) return; + if (!snc_mode) { + snc_mode = get_snc_mode(); + if (snc_mode > 1)
From what I can tell this is the only place the global is used and this can just be: if (get_snc_mode() > 1)
I wanted to print the message below only on the first call to run_single_test() and then print relevant warnings at the very end of each test. I thought that was your intention when we discussed what messages are supposed to be printed and when in v2 of this series.
Do you think it would be better to just print this message at the start of each test?
Yes. If there is a problem with a test the user could be expected to start tracing back messages printed from beginning of failing test.
Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add a second local static variable into get_snc_mode() that would control whether or not the message should be printed?
I do not see where more local static variables may be needed.
Reinette