blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20
Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche bvanassche@acm.org --- block/blk-core.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 4b728fa1c138..e961896a8717 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; }
+/* + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been + * set because this causes blk_mq_freeze_queue() to deadlock if + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios + * on the plug list is not necessary since a q_usage_counter reference is held + * while a bio is on the plug list. + */ static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,7 +640,8 @@ static void __submit_bio(struct bio *bio)
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + } else if (likely(bio_zone_write_plugging(bio) || + bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; if ((bio->bi_opf & REQ_POLLED) && @@ -643,7 +651,8 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); } - blk_queue_exit(disk->queue); + if (!bio_zone_write_plugging(bio)) + blk_queue_exit(disk->queue); }
blk_finish_plug(&plug);
On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote:
+/*
- Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
- set because this causes blk_mq_freeze_queue() to deadlock if
- blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
- on the plug list is not necessary since a q_usage_counter reference is held
- while a bio is on the plug list.
- */
How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag was set in a stacking driver we need to clear it before resubmitting the bio I think.
Can you provide a null_blk based reproducer for your testcase to see what happens here?
Either way we can't just simply skip taking q_usage_count.
On 5/15/25 9:51 PM, Christoph Hellwig wrote:
On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote:
+/*
- Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
- set because this causes blk_mq_freeze_queue() to deadlock if
- blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
- on the plug list is not necessary since a q_usage_counter reference is held
- while a bio is on the plug list.
- */
How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag was set in a stacking driver we need to clear it before resubmitting the bio I think.
Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must remain set until the bio has completed. Here is an example of code in the bio completion path that tests this flag:
static inline void blk_zone_bio_endio(struct bio *bio) { /* * For write BIOs to zoned devices, signal the completion of the BIO so * that the next write BIO can be submitted by zone write plugging. */ if (bio_zone_write_plugging(bio)) blk_zone_write_plug_bio_endio(bio); }
The bio_zone_write_plugging() definition is as follows:
static inline bool bio_zone_write_plugging(struct bio *bio) { return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); }
Can you provide a null_blk based reproducer for your testcase to see what happens here?
My attempts so far to build a reproduce for the blktests framework have been unsuccessful. This test script runs fine in the VM that I use for kernel testing:
https://github.com/bvanassche/blktests/blob/master/tests/zbd/013
Either way we can't just simply skip taking q_usage_count.
Why not? If BIO_ZONE_WRITE_PLUGGING is set, that guarantees that a queue reference count is held and will remain held across the entire disk->fops->submit_bio() invocation, isn't it? From blk_zone_wplug_bio_work(), below the submit_bio_noacct_nocheck(bio) call:
if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) blk_queue_exit(bdev->bd_disk->queue);
Thanks,
Bart.
On Mon, May 19, 2025 at 03:22:42PM -0700, Bart Van Assche wrote:
Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must remain set until the bio has completed. Here is an example of code in the bio completion path that tests this flag:
True. Well, we'll need some other flag that to tell lower levels to ignore the flag because it is owned by the higher level.
linux-stable-mirror@lists.linaro.org