Hi all,
I have noticed strange messages in kernel version 6.9, obviously from CPU topology detection, which were not present in 6.8.y and earlier kernels.
This is coming from an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T) in an Asus Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest available from Asus. All memory slots occupied, so 256 GB RAM in total.
From a "good boot", e.g. kernel 6.8.11, dmesg output looks like this:
[ 1.823797] smpboot: x86: Booting SMP configuration: [ 1.823799] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 [ 1.827514] .... node #1, CPUs: #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 [ 0.011462] smpboot: CPU 12 Converting physical 0 to logical die 1
[ 1.875532] .... node #0, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 [ 1.882453] .... node #1, CPUs: #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 [ 1.887532] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 1.933640] smp: Brought up 2 nodes, 48 CPUs [ 1.933640] smpboot: Max logical packages: 2 [ 1.933640] smpboot: Total of 48 processors activated (259199.61 BogoMIPS)
From a "bad" boot, e.g. kernel 6.9.2, dmesg output has these messages in it:
[ 1.785937] smpboot: x86: Booting SMP configuration: [ 1.785939] .... node #0, CPUs: #4 [ 1.786215] .... node #1, CPUs: #12 #16 [ 1.793547] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 1.797547] .... node #0, CPUs: #1 #2 #3 #5 #6 #7 #8 #9 #10 #11 [ 1.801858] .... node #1, CPUs: #13 #14 #15 #17 #18 #19 #20 #21 #22 #23 [ 1.804687] .... node #0, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 [ 1.810728] .... node #1, CPUs: #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 [ 1.901547] smp: Brought up 2 nodes, 48 CPUs [ 1.901547] smpboot: Total of 48 processors activated (259207.87 BogoMIPS) [ 1.903803] BUG: arch topology borken [ 1.903879] the SMT domain not a subset of the CLS domain [ 1.903970] BUG: arch topology borken [ 1.904040] the SMT domain not a subset of the CLS domain [ 1.904128] BUG: arch topology borken [ 1.904198] the SMT domain not a subset of the CLS domain
... and this "BUG" and the following line repeat 48 times which is the number of logical CPUs this machine has. Also, there is a funny typo in the message, but that might be intended, I guess?! Moreover I noticed, from node #1, CPU #12 detection message is missing, so the counting maybe wrong?!
However the machine boots, and except from these strange messages, I cannot detect any other abnormal behaviour. It is running ~15 QEMU/KVM virtual machines just fine. Because these messages look unusual and a bit scary though, I have bisected the issue, to be able to report it here. The first bad commit I found is this one:
22d63660c35eb751c63a709bf901a64c1726592a is the first bad commit commit 22d63660c35eb751c63a709bf901a64c1726592a Author: Thomas Gleixner tglx@linutronix.de Date: Tue Feb 13 22:04:08 2024 +0100
x86/cpu: Use common topology code for Intel
Intel CPUs use either topology leaf 0xb/0x1f evaluation or the legacy SMP/HT evaluation based on CPUID leaf 0x1/0x4.
Move it over to the consolidated topology code and remove the random topology hacks which are sprinkled into the Intel and the common code.
No functional change intended.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Juergen Gross jgross@suse.com Tested-by: Sohil Mehta sohil.mehta@intel.com Tested-by: Michael Kelley mhklinux@outlook.com Tested-by: Zhang Rui rui.zhang@intel.com Tested-by: Wang Wendy wendy.wang@intel.com Tested-by: K Prateek Nayak kprateek.nayak@amd.com Link: https://lore.kernel.org/r/20240212153624.893644349@linutronix.de
arch/x86/kernel/cpu/common.c | 65 ----------------------------------- arch/x86/kernel/cpu/cpu.h | 4 --- arch/x86/kernel/cpu/intel.c | 25 -------------- arch/x86/kernel/cpu/topology.c | 22 ------------ arch/x86/kernel/cpu/topology_common.c | 5 ++- 5 files changed, 4 insertions(+), 117 deletions(-) root@linus:/usr/src/linux#
I attach my bisect log, and full dmesg output from a good and from a bad kernel version.
Moreover, the last 3 bad kernels from my bisect session did not boot at all, including the one with commit SHA1 from the first bad commit above. These kernels also had the series of "BUG" messages scrolling through on the console, and then additionally a kernel panic, seemingly coming from a divide exception from function init_intel_microcode:
<5>[ 5.968685] Key type dns_resolver registered <4>[ 5.974402] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' <4>[ 5.977017] divide error: 0000 [#1] PREEMPT SMP PTI <4>[ 5.977116] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc4+ #1 <4>[ 5.977213] Hardware name: ASUSTeK COMPUTER INC. Z9PE-D16 Series/Z9PE-D16 Series, BIOS 5601 06/11/2015 <4>[ 5.977337] RIP: 0010:init_intel_microcode+0x3c/0x80 <4>[ 5.977436] Code: ff 75 44 40 80 fe 05 76 3e 48 8b 05 b6 45 f7 ff a9 00 00 00 40 75 30 8b 05 85 46 f7 ff 0f b7 0d aa 46 f7 ff 31 d2 48 c1 e0 0a <48> f7 f1 89 05 9b f9 46 ff 48 c7 c0 c0 98 e4 a8 31 d2 31 c9 31 f6 <4>[ 5.977602] RSP: 0000:ffffb79b8008fd80 EFLAGS: 00010206 <4>[ 5.977697] RAX: 0000000001e00000 RBX: 0000000000000000 RCX: 0000000000000000 <4>[ 5.977795] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000 <4>[ 5.977894] RBP: ffffb79b8008fdf8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 5.977992] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4>[ 5.978090] R13: 000000000000019a R14: ffffb79b8008fe08 R15: ffff96ad4026cf00 <4>[ 5.978187] FS: 0000000000000000(0000) GS:ffff96cc3fa40000(0000) knlGS:0000000000000000 <4>[ 5.978308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 5.978402] CR2: 0000000000000000 CR3: 0000000e6d236001 CR4: 00000000001706f0 <4>[ 5.978500] Call Trace: <4>[ 5.978588] <TASK> <4>[ 5.978675] ? show_regs+0x6d/0x80 <4>[ 5.978767] ? die+0x37/0xa0 <4>[ 5.978857] ? do_trap+0xd4/0xf0 <4>[ 5.978948] ? do_error_trap+0x71/0xb0 <4>[ 5.979040] ? init_intel_microcode+0x3c/0x80 <4>[ 5.979131] ? exc_divide_error+0x3a/0x70 <4>[ 5.979226] ? init_intel_microcode+0x3c/0x80 <4>[ 5.979317] ? asm_exc_divide_error+0x1b/0x20 <4>[ 5.979427] ? init_intel_microcode+0x3c/0x80 <4>[ 5.979520] ? microcode_init+0x196/0x260 <4>[ 5.979612] ? __pfx_microcode_init+0x10/0x10 <4>[ 5.979718] do_one_initcall+0x5e/0x340 <4>[ 5.979813] kernel_init_freeable+0x322/0x490 <4>[ 5.979906] ? __pfx_kernel_init+0x10/0x10 <4>[ 5.979998] kernel_init+0x1b/0x200 <4>[ 5.980089] ret_from_fork+0x47/0x70 <4>[ 5.980180] ? __pfx_kernel_init+0x10/0x10 <4>[ 5.980272] ret_from_fork_asm+0x1b/0x30 <4>[ 5.980364] </TASK> <4>[ 5.980450] Modules linked in: <4>[ 5.980544] ---[ end trace 0000000000000000 ]--- <4>[ 6.959943] RIP: 0010:init_intel_microcode+0x3c/0x80 <4>[ 6.960041] Code: ff 75 44 40 80 fe 05 76 3e 48 8b 05 b6 45 f7 ff a9 00 00 00 40 75 30 8b 05 85 46 f7 ff 0f b7 0d aa 46 f7 ff 31 d2 48 c1 e0 0a <48> f7 f1 89 05 9b f9 46 ff 48 c7 c0 c0 98 e4 a8 31 d2 31 c9 31 f6 <4>[ 6.960207] RSP: 0000:ffffb79b8008fd80 EFLAGS: 00010206 <4>[ 6.960316] RAX: 0000000001e00000 RBX: 0000000000000000 RCX: 0000000000000000 <4>[ 6.960414] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000 <4>[ 6.960512] RBP: ffffb79b8008fdf8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 6.960610] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4>[ 6.960708] R13: 000000000000019a R14: ffffb79b8008fe08 R15: ffff96ad4026cf00 <4>[ 6.960806] FS: 0000000000000000(0000) GS:ffff96cc3fa40000(0000) knlGS:0000000000000000 <4>[ 6.960927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 6.961021] CR2: 0000000000000000 CR3: 0000000e6d236001 CR4: 00000000001706f0 <0>[ 6.961120] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b <0>[ 6.961312] Kernel Offset: 0x25c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
I also attached full dmesg log file "dmesg-erst-7373208397568540677" of this panic which I could find in /var/lib/systemd/pstore.
Beste Grüße, Peter Schneider