Hi Naresh!
On Wed 14-07-21 19:22:59, Naresh Kamboju wrote:
My two cents, While running ssuite long running stress testing we have noticed deadlock.
So if you drop that, all works well? I'll go drop that from the queues now.
Let me drop that patch and test it again.
Crash log,
[ 1957.278399] ============================================ [ 1957.283717] WARNING: possible recursive locking detected [ 1957.289031] 5.13.2-rc1 #1 Not tainted [ 1957.292703] -------------------------------------------- [ 1957.298016] kworker/u8:7/236 is trying to acquire lock: [ 1957.303241] ffff8cc203f92c38 (&bfqd->lock){-.-.}-{2:2}, at: bfq_finish_requeue_request+0x55/0x500 [bfq] [ 1957.312643] [ 1957.312643] but task is already holding lock: [ 1957.318467] ffff8cc203f92c38 (&bfqd->lock){-.-.}-{2:2}, at: bfq_insert_requests+0x81/0x1750 [bfq] [ 1957.327334] [ 1957.327334] other info that might help us debug this: [ 1957.333852] Possible unsafe locking scenario: [ 1957.333852] [ 1957.339762] CPU0 [ 1957.342206] ---- [ 1957.344651] lock(&bfqd->lock); [ 1957.347873] lock(&bfqd->lock); [ 1957.351097] [ 1957.351097] *** DEADLOCK *** [ 1957.351097] [ 1957.357008] May be due to missing lock nesting notation [ 1957.357008] [ 1957.363783] 3 locks held by kworker/u8:7/236: [ 1957.368136] #0: ffff8cc2009c5938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x207/0x5e0 [ 1957.377782] #1: ffff9ba980d57e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x207/0x5e0 [ 1957.388640] #2: ffff8cc203f92c38 (&bfqd->lock){-.-.}-{2:2}, at: bfq_insert_requests+0x81/0x1750 [bfq] [ 1957.397938] [ 1957.397938] stack backtrace: [ 1957.402291] CPU: 1 PID: 236 Comm: kworker/u8:7 Not tainted 5.13.2-rc1 #1 [ 1957.408989] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 1957.416374] Workqueue: writeback wb_workfn (flush-8:0) [ 1957.421513] Call Trace: [ 1957.423966] dump_stack+0x76/0x95 [ 1957.427283] __lock_acquire+0xb70/0x1a50 [ 1957.431203] ? lock_is_held_type+0xa0/0x110 [ 1957.435388] ? bfq_init_rq+0x30e/0x1140 [bfq] [ 1957.439748] lock_acquire+0x258/0x2e0 [ 1957.443413] ? bfq_finish_requeue_request+0x55/0x500 [bfq] [ 1957.448923] ? __lock_acquire+0x4a6/0x1a50 [ 1957.453016] ? __lock_acquire+0x3e0/0x1a50 [ 1957.457107] _raw_spin_lock_irqsave+0x3f/0x60 [ 1957.461466] ? bfq_finish_requeue_request+0x55/0x500 [bfq] [ 1957.466950] bfq_finish_requeue_request+0x55/0x500 [bfq] [ 1957.472256] ? rcu_read_lock_sched_held+0x4f/0x80 [ 1957.476960] blk_mq_free_request+0x3e/0x140 [ 1957.481146] blk_put_request+0xe/0x10 [ 1957.484804] blk_attempt_req_merge+0x1d/0x30 [ 1957.489075] elv_attempt_insert_merge+0x34/0x90 [ 1957.493599] blk_mq_sched_try_insert_merge+0x2c/0x50 [ 1957.498556] bfq_insert_requests+0x8d/0x1750 [bfq] [ 1957.503342] ? find_held_lock+0x35/0xa0 [ 1957.507180] ? writeback_sb_inodes+0x35a/0x550 [ 1957.511618] blk_mq_sched_insert_requests+0xd9/0x2a0 [ 1957.516580] blk_mq_flush_plug_list+0x138/0x270 [ 1957.521110] blk_flush_plug_list+0xd1/0x100 [ 1957.525295] blk_finish_plug+0x2c/0x40 [ 1957.529045] wb_writeback+0x1ab/0x430 [ 1957.532702] ? _raw_spin_unlock_bh+0x30/0x40 [ 1957.536970] wb_workfn+0xcb/0x660 [ 1957.540286] ? wb_workfn+0xcb/0x660 [ 1957.543770] ? lock_acquire+0x258/0x2e0 [ 1957.547600] ? process_one_work+0x207/0x5e0 [ 1957.551778] process_one_work+0x289/0x5e0 [ 1957.555782] ? inode_wait_for_writeback+0x40/0x40 [ 1957.560477] ? process_one_work+0x289/0x5e0 [ 1957.564656] worker_thread+0x3c/0x3f0 [ 1957.568315] ? process_one_work+0x5e0/0x5e0 [ 1957.572500] kthread+0x14c/0x170 [ 1957.575733] ? set_kthread_struct+0x40/0x40 [ 1957.579921] ret_from_fork+0x22/0x30 Waiting for transitory to terminate: 5[0KWaiting for transitory to terminate: 4[0K[ 2106.390977] systemd[1]: systemd-resolved.service: Watchdog timeout (limit 3min)! [ 2106.398454] systemd[1]: systemd-resolved.service: Killing process 349 (systemd-resolve) with signal SIGABRT.
Thanks for testing and the report! So this looks like you didn't have commit fd2ef39cc9a ("blk: Fix lock inversion between ioc lock and bfqd lock") applied. As I was looking into BFQ code indeed commit a921c655f2 ("bfq: Remove merged request already in bfq_requests_merged()") on its own would introduce this deadlock which then gets fixed up by commit fd2ef39cc9a (I didn't realize this when writing the series). So we either need to apply both commits or none of them. Do you see some problems with both commits applied?
Honza