Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

26 Jun 2020


      On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre steve@einval.com wrote:
...
On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
...
On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
...
On Fri, Jun 26, 2020 at 3:41 PM Greg KH gregkh@linuxfoundation.org wrote:
...
On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
...
...
...
...
Considering I'm running strace build tests to provoke this bug,
finding the failure in a commit talking about ptrace changes does look
very suspicious...!
Annoyingly, I can't reproduce this on my disparate other machines
here, suggesting it's maybe(?) timing related.
Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
prints a warning to dmesg? If so, can you share that warning?
I mean the machine locks hard - X stops updating, the mouse/keyboard
stop responding. No pings, etc. When I reboot, there's nothing in the
logs.
...
If you don't have any way to see console output, and you don't have a
working serial console setup or such, you may want to try re-running
those tests while the kernel is booted with netconsole enabled to log
to a different machine over UDP (see
https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
ACK, will try that now for you.
...
You may want to try setting the sysctl kernel.sysrq=1 , then when the
system has locked up, press ALT+PRINT+L (to generate stack traces for
all active CPUs from NMI context), and maybe also ALT+PRINT+T and
ALT+PRINT+W (to collect more information about active tasks).
Nod.
...
(If you share stack traces from these things with us, it would be
helpful if you could run them through scripts/decode_stacktrace.pl
from the kernel tree first, to add line number information.)
ACK.
Output passed through scripts/decode_stacktrace.sh attached.
Just about to try John's suggestion next.
Okay, so this is some sort of deadlock...
Looking at the NMI backtraces, all the CPUs are blocked on spinlocks:
CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty()
CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait()
CPU 2 is blocked on something, somewhere in ptrace_stop()
CPU 0 is stuck on a lock in do_exit()
So I think it's probably something like a classic deadlock, or a
sleeping-in-atomic issue, or a lock-balancing issue (or memory
corruption, that can cause all kinds of weird errors)?
If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be
able to pinpoint the issue.
If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help.
If it is memory corruption, CONFIG_KASAN=y should discover it... but
that might majorly mess up the timing, so if this really is a race,
that might not work.
Maybe flip all of those on, and if it doesn't reproduce anymore, turn
off CONFIG_KASAN and try again?

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards