On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre steve@einval.com wrote:
On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
On Fri, Jun 26, 2020 at 3:41 PM Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
...
Considering I'm running strace build tests to provoke this bug, finding the failure in a commit talking about ptrace changes does look very suspicious...!
Annoyingly, I can't reproduce this on my disparate other machines here, suggesting it's maybe(?) timing related.
Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure prints a warning to dmesg? If so, can you share that warning?
I mean the machine locks hard - X stops updating, the mouse/keyboard stop responding. No pings, etc. When I reboot, there's nothing in the logs.
If you don't have any way to see console output, and you don't have a working serial console setup or such, you may want to try re-running those tests while the kernel is booted with netconsole enabled to log to a different machine over UDP (see https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
ACK, will try that now for you.
You may want to try setting the sysctl kernel.sysrq=1 , then when the system has locked up, press ALT+PRINT+L (to generate stack traces for all active CPUs from NMI context), and maybe also ALT+PRINT+T and ALT+PRINT+W (to collect more information about active tasks).
Nod.
(If you share stack traces from these things with us, it would be helpful if you could run them through scripts/decode_stacktrace.pl from the kernel tree first, to add line number information.)
ACK.
Output passed through scripts/decode_stacktrace.sh attached.
Just about to try John's suggestion next.
Okay, so this is some sort of deadlock...
Looking at the NMI backtraces, all the CPUs are blocked on spinlocks: CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty() CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait() CPU 2 is blocked on something, somewhere in ptrace_stop() CPU 0 is stuck on a lock in do_exit()
So I think it's probably something like a classic deadlock, or a sleeping-in-atomic issue, or a lock-balancing issue (or memory corruption, that can cause all kinds of weird errors)?
If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be able to pinpoint the issue. If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help. If it is memory corruption, CONFIG_KASAN=y should discover it... but that might majorly mess up the timing, so if this really is a race, that might not work.
Maybe flip all of those on, and if it doesn't reproduce anymore, turn off CONFIG_KASAN and try again?