On Wed, Sep 5, 2018 at 8:34 AM Guenter Roeck linux@roeck-us.net wrote:
On 09/05/2018 02:01 AM, Greg Kroah-Hartman wrote:
[ 9990.754641] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:1:155] [ 9990.762601] RIP: 0010:smp_call_function_many+0x208/0x270 [ 9990.762601] Code: e8 0d d1 77 00 3b 05 cb f0 24 01 0f 83 86 fe ff ff 48 63 d0 49 8b 0c 24 48 03 0c d5 00 f7 11 a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d d0 4c 89 f2 4c 89 ee 44 89
It's stuck in this loop:
loop: pause mov 0x18(%rcx),%edx and $0x1,%edx jne loop
which is csd_lock_wait().
Judging by the offset in smp_call_function_many(), it's the final one (there's two: the other one is part of "csd_lock()"). But that's just a guess.
Anyway, it means that we're waiting for another CPU to finish processing an IPI - either a previous one we sent asynchronously (if it's the earlier csd_lock() case) or the TLB IPI we just sent and we're waiting for completion of.
Not tested, but I see it in v4.17.19 and in v4.18.6-rc2. Turns out it is related to heavy load, not to suspend/resume. At this point I suspect that it may be an AMD/Ryzen specific problem - it looks like it disappears if I add "kernel.randomize_va_space = 0" to /etc/sysctl.conf. No idea if it is a CPU bug or some AMD specific code problem. I'll try to analyze it further.
Ouch. Some IPI sending/receiving problem would be very very painful to debug if it's hw related.
Linus