On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote:
From: Xunlei Pang xlpang@redhat.com
On some of our systems, we notice this error popping up on occasion, completely hanging the system.
[<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 [<ffffffc0000d0f14>] activate_task+0x7c/0x90 [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 [<ffffffc000d2c018>] __schedule+0x6f0/0x728 [<ffffffc000d2c278>] schedule+0x78/0x98 [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 [<ffffffc00012cf84>] do_futex+0x1c4/0x550
It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously similar to what Xuneli Pang observed in his crash, and with this fix, my issue goes away (my system has survivied approx 1500 reboots and a few nasty tests so far)
Alongside this patch in the tree, there are a few other bits and pieces pertaining to futex, rtmutex and kernel/sched/, but those patches creates weird crashes that I have not been able to dissect yet. Once (if) I have been able to figure those out (and test), they will be sent later.
I am sure other users of LTS that also use sched_deadline will run into this issue, so I think it is a good candidate for 4.4-stable. Possibly also to 4.9 and 4.14, but I have not had time to test for those versions.
But this patch relies on:
2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
for pointer stability, but that patch in turn relies on the whole FUTEX_UNLOCK_PI patch set:
$ git log --oneline 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40
56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex bebe5b514345 futex: Futex_unlock_pi() determinism cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() 38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() 50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter() 16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock 73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state bf92cf3a5100 futex: Cleanup refcounting 734009e96d19 futex: Change locking rules 5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API fffa954fb528 futex: Remove rt_mutex_deadlock_account_*() 1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex()
and possibly some follow-up fixes on that (I have vague memories of that).
As is, just the one patch you propose isn't correct :/
Yes, that was a ginormous amount of work to fix a seemingly simple splat :-(