On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
On Jul 27, 2023, at 7:18 PM, Joel Fernandes joel@joelfernandes.org wrote:
On Jul 27, 2023, at 4:33 PM, Paul E. McKenney paulmck@kernel.org wrote:
On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
On 7/27/23 09:07, Paul E. McKenney wrote:
...]
No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
or long runtime.
[ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.797998] Running RCU synchronous self tests
[ 0.798209] Running RCU synchronous self tests
[ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[ 0.925419] Running RCU-tasks wait API self tests
(hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
I'd be much more concerned about that.
First I have heard of this, so thank you for letting me know.
About what fraction of the time does this happen?
Here is a sample test log from yesterday's -next. This is with x86_64.
Today's -next always crashes, so no data.
Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h d ... running ......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem 4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]: hd ... running ....... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395] :hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC 395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974 ]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53 C974]:hd ... running ....... passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem 1G:scsi[53C810]:cd ... running .................R........... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e fi32:mem2G:scsi[53C895A]:hd ... running ............. passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI ON]:hd ... running ..................R.......... passed
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi [MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M EGASAS2]:hd ... running ...... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS 2]:hd ... running .................R.............. failed (silent)
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2 ]:hd ... running .......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci]:hd ... running .................R................. passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci-old]:hd ... running ................... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ......... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running .................R... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me m2G:virtio:cd ... running ......... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n vme:hd ... running ...... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef i32:mem1G:sdhci:mmc:hd ... running ...... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init rd ... running ...... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C 810]:hd ... running ....... passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running .....................R....... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G: ata:hd ... running .................R.............. failed (silent)
An earlier test run:
Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ........ passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h d ... running .......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem 4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]: hd ... running ......... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395] :hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC 395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974 ]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53 C974]:hd ... running ........ passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem 1G:scsi[53C810]:cd ... running .......... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e fi32:mem2G:scsi[53C895A]:hd ... running .................R..... passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI ON]:hd ... running .................R.............. failed (silent)
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi [MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M EGASAS2]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS 2]:hd ... running ....... passed
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2 ]:hd ... running .......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci]:hd ... running .......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci-old]:hd ... running .......... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me m2G:virtio:cd ... running ......... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n vme:hd ... running ....... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef i32:mem1G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init rd ... running ....... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C 810]:hd ... running ........ passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ....... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G: ata:hd ... running ....... passed
"R" means retry, and the dots reflect time expired. It looks like it happens most of the time,
but not always, on affected CPUs. I don't have specific data for non-Intel CPUs. I don't think
I see the problem there, but there is too much interference from other problems to be sure.
For comparison, here is the result from the latest mainline:
Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .......... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h d ... running ......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ........... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running ........ passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem 4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]: hd ... running ....... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395] :hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC 395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974 ]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53 C974]:hd ... running ....... passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem 1G:scsi[53C810]:cd ... running .......... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e fi32:mem2G:scsi[53C895A]:hd ... running ....... passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI ON]:hd ... running ............. passed
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi [MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M EGASAS2]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS 2]:hd ... running ...... passed
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2 ]:hd ... running ......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ......... passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci]:hd ... running ......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p ci-old]:hd ... running ......... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me m2G:virtio:cd ... running ............ passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n vme:hd ... running ....... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef i32:mem1G:sdhci:mmc:hd ... running ...... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init rd ... running ...... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C 810]:hd ... running ....... passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running .......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running .......... passed
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ...... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G: ata:hd ... running ...... passed
I freely confess that I am having a hard time imagining what would
be CPU dependent in that code. Timing, maybe? Whatever the reason,
I am not seeing these failures in my testing.
So which of the following Kconfig options is defined in your .config?
CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
If you have more than one of them, could you please apply this patch
and show me the corresponding console output from the resulting hang?
FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
I do see this now on 5.15 stable:
TASKS03 ------- 3089 GPs (0.858056/s) QEMU killed TASKS03 no success message, 64 successful version messages !!! PID 3309783 hung at 3781 vs. 3600 seconds
I have not looked too closely yet. The full test artifacts are here:
[1]Artifacts of linux-5.15.y 5.15.123 : /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins] [2]box.joelfernandes.org [3]apple-touch-icon.png
Thanks,
- Joel
(Apologies if the email is html, I am sending from phone).
Heh. I have a script that runs lynx. Which isn't perfect, but usually makes things at least somewhat legible.
This looks like the prototypical hard hang with interrupts disabled, which could be anywhere in the kernel, including RCU. I am not seeing this. but the usual cause when I have seen it in the past was deadlock of irq-disabled locks. In one spectacular case, it was a timekeeping failure that messed up a CPU-hotplug operation.
If this is reproducible, one trick would be to have a script look at the console.log file, and have it do something (NMI? sysrq? something else?) to qemu if output ceased for too long.
One way to do this without messing with the rcutorture scripting is to grab the qemu-cmd file from this run, and then invoke that file from your own script, possibly with suitable modifications to qemu's parameters.
Thoughts?
Thanx, Paul
Cheers,
Joel
Thanx, Paul
commit 709a917710dc01798e01750ea628ece4bfc42b7b
Author: Paul E. McKenney paulmck@kernel.org
Date: Thu Jul 27 13:13:46 2023 -0700
rcu-tasks: Add printk()s to localize boot-time self-test hang
Currently, rcu_tasks_initiate_self_tests() prints a message and then
initiates self tests on up to three different RCU Tasks flavors. If one
of the flavors has a grace-period hang, it is not easy to work out which
of the three hung. This commit therefore prints a message prior to each
individual test.
Reported-by: Guenter Roeck linux@roeck-us.net
Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 56c470a489c8..427433c90935 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
static void rcu_tasks_initiate_self_tests(void)
{
- pr_info("Running RCU-tasks wait API self tests\n");
#ifdef CONFIG_TASKS_RCU
pr_info("Running RCU Tasks wait API self tests\n");
tests[0].runstart = jiffies;
synchronize_rcu_tasks();
call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_RUDE_RCU
pr_info("Running RCU Tasks Rude wait API self tests\n");
tests[1].runstart = jiffies;
synchronize_rcu_tasks_rude();
call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_TRACE_RCU
pr_info("Running RCU Tasks Trace wait API self tests\n");
tests[2].runstart = jiffies;
synchronize_rcu_tasks_trace();
call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
References
Visible links:
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
Hidden links: 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
On Fri, Jul 28, 2023 at 6:58 PM Paul E. McKenney paulmck@kernel.org wrote:
On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
On Jul 27, 2023, at 7:18 PM, Joel Fernandes joel@joelfernandes.org wrote:
On Jul 27, 2023, at 4:33 PM, Paul E. McKenney paulmck@kernel.org wrote:
On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
On 7/27/23 09:07, Paul E. McKenney wrote:
...]
No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
or long runtime.
[ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.797998] Running RCU synchronous self tests
[ 0.798209] Running RCU synchronous self tests
[ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[ 0.925419] Running RCU-tasks wait API self tests
(hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
[...]
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me m2G:virtio:cd ... running ......... passed
[...]
I freely confess that I am having a hard time imagining what would
be CPU dependent in that code. Timing, maybe? Whatever the reason,
I am not seeing these failures in my testing.
So which of the following Kconfig options is defined in your .config?
CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
If you have more than one of them, could you please apply this patch
and show me the corresponding console output from the resulting hang?
FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
I do see this now on 5.15 stable:
TASKS03 ------- 3089 GPs (0.858056/s) QEMU killed TASKS03 no success message, 64 successful version messages !!! PID 3309783 hung at 3781 vs. 3600 seconds
I have not looked too closely yet. The full test artifacts are here:
[1]Artifacts of linux-5.15.y 5.15.123 : /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins] [2]box.joelfernandes.org [3]apple-touch-icon.png
Thanks,
- Joel
(Apologies if the email is html, I am sending from phone).
Heh. I have a script that runs lynx. Which isn't perfect, but usually makes things at least somewhat legible.
Sorry I was too optimistic about the iPhone's capabilities when it came to mailing list emails. Here's what I said: -------------- I do see this now on 5.15 stable:
TASKS03 ------- 3089 GPs (0.858056/s) QEMU killed TASKS03 no success message, 64 successful version messages !!! PID 3309783 hung at 3781 vs. 3600 seconds
Link to full logs/artifacts: http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las... ----------------
This looks like the prototypical hard hang with interrupts disabled, which could be anywhere in the kernel, including RCU. I am not seeing this. but the usual cause when I have seen it in the past was deadlock of irq-disabled locks. In one spectacular case, it was a timekeeping failure that messed up a CPU-hotplug operation.
If this is reproducible, one trick would be to have a script look at the console.log file, and have it do something (NMI? sysrq? something else?) to qemu if output ceased for too long.
One way to do this without messing with the rcutorture scripting is to grab the qemu-cmd file from this run, and then invoke that file from your own script, possibly with suitable modifications to qemu's parameters.
Would it be better to have such monitoring as part of rcutorture testing itself? Alternatively there is the NMI hardlockup detector which I believe should also detect such cases and dump stacks.
thanks,
- Joel
Thoughts?
Thanx, Paul
Cheers,
Joel
Thanx, Paul
commit 709a917710dc01798e01750ea628ece4bfc42b7b
Author: Paul E. McKenney paulmck@kernel.org
Date: Thu Jul 27 13:13:46 2023 -0700
rcu-tasks: Add printk()s to localize boot-time self-test hang
Currently, rcu_tasks_initiate_self_tests() prints a message and then
initiates self tests on up to three different RCU Tasks flavors. If one
of the flavors has a grace-period hang, it is not easy to work out which
of the three hung. This commit therefore prints a message prior to each
individual test.
Reported-by: Guenter Roeck linux@roeck-us.net
Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 56c470a489c8..427433c90935 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
static void rcu_tasks_initiate_self_tests(void)
{
- pr_info("Running RCU-tasks wait API self tests\n");
#ifdef CONFIG_TASKS_RCU
pr_info("Running RCU Tasks wait API self tests\n");
tests[0].runstart = jiffies;
synchronize_rcu_tasks();
call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_RUDE_RCU
pr_info("Running RCU Tasks Rude wait API self tests\n");
tests[1].runstart = jiffies;
synchronize_rcu_tasks_rude();
call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_TRACE_RCU
pr_info("Running RCU Tasks Trace wait API self tests\n");
tests[2].runstart = jiffies;
synchronize_rcu_tasks_trace();
call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
References
Visible links:
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
Hidden links: 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
On Fri, Jul 28, 2023 at 09:25:35PM -0400, Joel Fernandes wrote:
On Fri, Jul 28, 2023 at 6:58 PM Paul E. McKenney paulmck@kernel.org wrote:
On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
On Jul 27, 2023, at 7:18 PM, Joel Fernandes joel@joelfernandes.org wrote:
On Jul 27, 2023, at 4:33 PM, Paul E. McKenney paulmck@kernel.org wrote:
On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
On 7/27/23 09:07, Paul E. McKenney wrote:
...]
No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
or long runtime.
[ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.797998] Running RCU synchronous self tests
[ 0.798209] Running RCU synchronous self tests
[ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[ 0.925419] Running RCU-tasks wait API self tests
(hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
[...]
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me m2G:virtio:cd ... running ......... passed
[...]
I freely confess that I am having a hard time imagining what would
be CPU dependent in that code. Timing, maybe? Whatever the reason,
I am not seeing these failures in my testing.
So which of the following Kconfig options is defined in your .config?
CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
If you have more than one of them, could you please apply this patch
and show me the corresponding console output from the resulting hang?
FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
I do see this now on 5.15 stable:
TASKS03 ------- 3089 GPs (0.858056/s) QEMU killed TASKS03 no success message, 64 successful version messages !!! PID 3309783 hung at 3781 vs. 3600 seconds
I have not looked too closely yet. The full test artifacts are here:
[1]Artifacts of linux-5.15.y 5.15.123 : /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins] [2]box.joelfernandes.org [3]apple-touch-icon.png
Thanks,
- Joel
(Apologies if the email is html, I am sending from phone).
Heh. I have a script that runs lynx. Which isn't perfect, but usually makes things at least somewhat legible.
Sorry I was too optimistic about the iPhone's capabilities when it came to mailing list emails. Here's what I said:
I do see this now on 5.15 stable:
TASKS03 ------- 3089 GPs (0.858056/s) QEMU killed TASKS03 no success message, 64 successful version messages !!! PID 3309783 hung at 3781 vs. 3600 seconds
Link to full logs/artifacts: http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
This looks like the prototypical hard hang with interrupts disabled, which could be anywhere in the kernel, including RCU. I am not seeing this. but the usual cause when I have seen it in the past was deadlock of irq-disabled locks. In one spectacular case, it was a timekeeping failure that messed up a CPU-hotplug operation.
If this is reproducible, one trick would be to have a script look at the console.log file, and have it do something (NMI? sysrq? something else?) to qemu if output ceased for too long.
One way to do this without messing with the rcutorture scripting is to grab the qemu-cmd file from this run, and then invoke that file from your own script, possibly with suitable modifications to qemu's parameters.
Would it be better to have such monitoring as part of rcutorture testing itself? Alternatively there is the NMI hardlockup detector which I believe should also detect such cases and dump stacks.
Quite possibly. But special-casing the prototype is probably going to be a lot faster and easier. If it works, then it might make a lot of sense to upgrade the scripting. If it doesn't work, then quite a bit less time is wasted than would be by messing with the scripting from the get-go.
Also, you have the option of making qemu be interactive and manually triggering things, for example by checking up on the run near the end. Or having something handing commands to qemu.
Either way allows much more interaction with qemu, and better experiementation, than could be done reasonably with the scripts.
Thanx, Paul
thanks,
- Joel
Thoughts?
Thanx, Paul
Cheers,
Joel
Thanx, Paul
commit 709a917710dc01798e01750ea628ece4bfc42b7b
Author: Paul E. McKenney paulmck@kernel.org
Date: Thu Jul 27 13:13:46 2023 -0700
rcu-tasks: Add printk()s to localize boot-time self-test hang
Currently, rcu_tasks_initiate_self_tests() prints a message and then
initiates self tests on up to three different RCU Tasks flavors. If one
of the flavors has a grace-period hang, it is not easy to work out which
of the three hung. This commit therefore prints a message prior to each
individual test.
Reported-by: Guenter Roeck linux@roeck-us.net
Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 56c470a489c8..427433c90935 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
static void rcu_tasks_initiate_self_tests(void)
{
- pr_info("Running RCU-tasks wait API self tests\n");
#ifdef CONFIG_TASKS_RCU
pr_info("Running RCU Tasks wait API self tests\n");
tests[0].runstart = jiffies;
synchronize_rcu_tasks();
call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_RUDE_RCU
pr_info("Running RCU Tasks Rude wait API self tests\n");
tests[1].runstart = jiffies;
synchronize_rcu_tasks_rude();
call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
#endif
#ifdef CONFIG_TASKS_TRACE_RCU
pr_info("Running RCU Tasks Trace wait API self tests\n");
tests[2].runstart = jiffies;
synchronize_rcu_tasks_trace();
call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
References
Visible links:
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
- http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
Hidden links: 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/las...
linux-stable-mirror@lists.linaro.org