blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20
Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche bvanassche@acm.org ---
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging() only before submitting the bio and not after it has been submitted.
block/blk-core.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..713fb3865260 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; }
+/* + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been + * set because this causes blk_mq_freeze_queue() to deadlock if + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios + * on the plug list is not necessary since a q_usage_counter reference is held + * while a bio is on the plug list. + */ static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio)
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + } else { struct gendisk *disk = bio->bi_bdev->bd_disk; + bool zwp = bio_zone_write_plugging(bio); + + if (unlikely(!zwp && bio_queue_enter(bio) != 0)) + goto finish_plug; if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) { @@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); } - blk_queue_exit(disk->queue); + + if (!zwp) + blk_queue_exit(disk->queue); }
+finish_plug: blk_finish_plug(&plug); }
On 5/22/25 11:14 AM, Bart Van Assche wrote:
blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20
Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche bvanassche@acm.org
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging() only before submitting the bio and not after it has been submitted.
block/blk-core.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..713fb3865260 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } +/*
- Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
- set because this causes blk_mq_freeze_queue() to deadlock if
- blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
- on the plug list is not necessary since a q_usage_counter reference is held
- while a bio is on the plug list.
- */
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
On 5/22/25 10:38 AM, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
Hi Jens,
q->q_usage_counter is not increased for bios on current->plug_list. q->q_usage_counter is increased before a bio is added to the zoned pluglist. So these two cases are different.
I think it is important to hold a q->q_usage_counter reference for bios on the zoned plug list because bios are added to that list after bio splitting happened. Hence, request queue limits must not change while any bio is on the zoned plug list.
Damien, do you think it would be possible to add a bio to the zoned plug list before it is split and not to hold q->q_usage_counter for bios on the zoned plug list?
Thanks,
Bart.
On Thu, May 22, 2025 at 11:32:58AM -0700, Bart Van Assche wrote:
On 5/22/25 10:38 AM, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
Hi Jens,
q->q_usage_counter is not increased for bios on current->plug_list. q->q_usage_counter is increased before a bio is added to the zoned pluglist. So these two cases are different.
I think it is important to hold a q->q_usage_counter reference for bios on the zoned plug list because bios are added to that list after bio splitting happened. Hence, request queue limits must not change while any bio is on the zoned plug list.
Hi Bart,
Can you share why request queue limit can't be changed after bio is added to zoned plug list?
If it is really true, we may have to drain zoned plug list when freezing queue.
Thanks, Ming
On 5/23/25 04:10, Ming Lei wrote:
On Thu, May 22, 2025 at 11:32:58AM -0700, Bart Van Assche wrote:
On 5/22/25 10:38 AM, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
Hi Jens,
q->q_usage_counter is not increased for bios on current->plug_list. q->q_usage_counter is increased before a bio is added to the zoned pluglist. So these two cases are different.
I think it is important to hold a q->q_usage_counter reference for bios on the zoned plug list because bios are added to that list after bio splitting happened. Hence, request queue limits must not change while any bio is on the zoned plug list.
Hi Bart,
Can you share why request queue limit can't be changed after bio is added to zoned plug list?
Because BIOs on a zone write plug list have already been split according to the current request queue limits. So until these BIOs are executed, we cannot change the limits as that potentially would require again splitting and that would completely messup the zone write pointer tracking of zone write plugging.
If it is really true, we may have to drain zoned plug list when freezing queue.
Yes, that is what we need. But we currently endup deadlocking on a queue freeze because the internal issuing of plugged zone write BIOs uses submit_bio_noacct_nocheck() which calls __submit_bio() and that function will call blk_queue_enter() again for DM device BIOs. We need to somehow cleanly avoid calling that queue enter without sprinkling around lots of zone stuff in this core block layer submission path.
Thanks, Ming
On 5/22/25 20:32, Bart Van Assche wrote:
On 5/22/25 10:38 AM, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
Hi Jens,
q->q_usage_counter is not increased for bios on current->plug_list. q->q_usage_counter is increased before a bio is added to the zoned pluglist. So these two cases are different.
I think it is important to hold a q->q_usage_counter reference for bios on the zoned plug list because bios are added to that list after bio splitting happened. Hence, request queue limits must not change while any bio is on the zoned plug list.
Damien, do you think it would be possible to add a bio to the zoned plug list before it is split and not to hold q->q_usage_counter for bios on the zoned plug list?
I do not think this is the right thing to do as that will completely mess up the queue usage counter value with regard to submit_bio() calls. That value is incremented when a BIO is submitted, and that should stay the same for zoned write BIOs that endup in a zone write plug instead of going straigth to the device.
The issue comes from the fact that blk_zone_wplug_bio_work() calls submit_bio_noacct_nocheck() which eventually calls __submit_bio() and that function calls blk_queue_enter() for DM devices. We still need to preserve that for the first submission of the BIO but should not/do not need to do the blk_queue_enter() call again when a BIO is submitted via the zone write plug work. I am not sure yet how to best deal with that.
I am traveling today and will not have time to cook something. Will take a look over the weekend.
On 5/22/25 19:38, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20
Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche bvanassche@acm.org
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging() only before submitting the bio and not after it has been submitted.
block/blk-core.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..713fb3865260 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } +/*
- Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
- set because this causes blk_mq_freeze_queue() to deadlock if
- blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
- on the plug list is not necessary since a q_usage_counter reference is held
- while a bio is on the plug list.
- */
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
What about this patch, completely untested...
diff --git a/block/blk-core.c b/block/blk-core.c index e8cc270a453f..3d2dec088a54 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,32 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; }
+static inline void disk_submit_bio(struct bio *bio) +{ + struct gendisk *disk = bio->bi_bdev->bd_disk; + bool need_queue_enter = !bio_zone_write_plugging(bio); + + /* + * BIOs that are issued from a zone write plug already hold a reference + * on the device queue usage counter. + */ + if (need_queue_enter) { + if (unlikely(bio_queue_enter(bio))) + return; + } + + if ((bio->bi_opf & REQ_POLLED) && + !(disk->queue->limits.features & BLK_FEAT_POLL)) { + bio->bi_status = BLK_STS_NOTSUPP; + bio_endio(bio); + } else { + disk->fops->submit_bio(bio); + } + + if (need_queue_enter) + blk_queue_exit(disk->queue); +} + static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -631,20 +657,10 @@ static void __submit_bio(struct bio *bio)
blk_start_plug(&plug);
- if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { + if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { - struct gendisk *disk = bio->bi_bdev->bd_disk; - - if ((bio->bi_opf & REQ_POLLED) && - !(disk->queue->limits.features & BLK_FEAT_POLL)) { - bio->bi_status = BLK_STS_NOTSUPP; - bio_endio(bio); - } else { - disk->fops->submit_bio(bio); - } - blk_queue_exit(disk->queue); - } + else + disk_submit_bio(bio)
blk_finish_plug(&plug); }
On 5/23/25 10:10, Damien Le Moal wrote:
What about this patch, completely untested...
diff --git a/block/blk-core.c b/block/blk-core.c index e8cc270a453f..3d2dec088a54 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,32 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; }
+static inline void disk_submit_bio(struct bio *bio) +{
struct gendisk *disk = bio->bi_bdev->bd_disk;
bool need_queue_enter = !bio_zone_write_plugging(bio);
/*
* BIOs that are issued from a zone write plug already hold a reference
* on the device queue usage counter.
*/
if (need_queue_enter) {
if (unlikely(bio_queue_enter(bio)))
return;
}
if ((bio->bi_opf & REQ_POLLED) &&
!(disk->queue->limits.features & BLK_FEAT_POLL)) {
bio->bi_status = BLK_STS_NOTSUPP;
bio_endio(bio);
} else {
disk->fops->submit_bio(bio);
}
if (need_queue_enter)
blk_queue_exit(disk->queue);
+}
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -631,20 +657,10 @@ static void __submit_bio(struct bio *bio)
blk_start_plug(&plug);
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) blk_mq_submit_bio(bio);
} else if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
if ((bio->bi_opf & REQ_POLLED) &&
!(disk->queue->limits.features & BLK_FEAT_POLL)) {
bio->bi_status = BLK_STS_NOTSUPP;
bio_endio(bio);
} else {
disk->fops->submit_bio(bio);
}
blk_queue_exit(disk->queue);
}
else
disk_submit_bio(bio)
Missing ";" here.
blk_finish_plug(&plug);
}
Looking into this deeper, the regular mq path actually likely has the same issue since it will call blk_queue_enter() if we do not have a cached request. So this solution is only partial and not good enough. We need something else.
On Fri, May 23, 2025 at 10:20:35AM +0200, Damien Le Moal wrote:
Looking into this deeper, the regular mq path actually likely has the same issue since it will call blk_queue_enter() if we do not have a cached request. So this solution is only partial and not good enough. We need something else.
The bio_zone_write_plugging case in blk_mq_submit_bio always reused the existing queue reference.
On Fri, May 23, 2025 at 10:10:30AM +0200, Damien Le Moal wrote:
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
What about this patch, completely untested...
This still looks extremely backwards as it messed up common core code for something that it shouldn't. I'd still love to see an actual reproducer ahead of me, but I think the problem is that blk_zone_wplug_bio_work calls into the still fairly high-level submit_bio_noacct_nocheck for bios that already went through the submit_bio machinery.
Something like this completely untested patch:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 8f15d1aa6eb8..6841af8a989c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1306,16 +1306,18 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) spin_unlock_irqrestore(&zwplug->lock, flags);
bdev = bio->bi_bdev; - submit_bio_noacct_nocheck(bio); - /* * blk-mq devices will reuse the extra reference on the request queue * usage counter we took when the BIO was plugged, but the submission * path for BIO-based devices will not do that. So drop this extra * reference here. */ - if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) + if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) { + bdev->bd_disk->fops->submit_bio(bio); blk_queue_exit(bdev->bd_disk->queue); + } else { + blk_mq_submit_bio(bio); + }
put_zwplug: /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
On 5/23/25 10:20, Christoph Hellwig wrote:
On Fri, May 23, 2025 at 10:10:30AM +0200, Damien Le Moal wrote:
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
What about this patch, completely untested...
This still looks extremely backwards as it messed up common core code for something that it shouldn't. I'd still love to see an actual reproducer ahead of me, but I think the problem is that blk_zone_wplug_bio_work calls into the still fairly high-level submit_bio_noacct_nocheck for bios that already went through the submit_bio machinery.
Something like this completely untested patch:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 8f15d1aa6eb8..6841af8a989c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1306,16 +1306,18 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) spin_unlock_irqrestore(&zwplug->lock, flags); bdev = bio->bi_bdev;
- submit_bio_noacct_nocheck(bio);
- /*
*/
- blk-mq devices will reuse the extra reference on the request queue
- usage counter we took when the BIO was plugged, but the submission
- path for BIO-based devices will not do that. So drop this extra
- reference here.
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO))
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
blk_queue_exit(bdev->bd_disk->queue);bdev->bd_disk->fops->submit_bio(bio);
- } else {
blk_mq_submit_bio(bio);
- }
Indeed, this looks better and simpler. Will give it a spin to check.
On 5/23/25 10:20, Christoph Hellwig wrote:
Something like this completely untested patch:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 8f15d1aa6eb8..6841af8a989c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1306,16 +1306,18 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) spin_unlock_irqrestore(&zwplug->lock, flags); bdev = bio->bi_bdev;
- submit_bio_noacct_nocheck(bio);
- /*
*/
- blk-mq devices will reuse the extra reference on the request queue
- usage counter we took when the BIO was plugged, but the submission
- path for BIO-based devices will not do that. So drop this extra
- reference here.
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO))
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
blk_queue_exit(bdev->bd_disk->queue);bdev->bd_disk->fops->submit_bio(bio);
- } else {
blk_mq_submit_bio(bio);
- }
put_zwplug: /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
I ran xfs on an SMR drive with this and I had no issues. Will do more test with a device mapper added.
On 5/23/25 1:20 AM, Christoph Hellwig wrote:
Something like this completely untested patch:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 8f15d1aa6eb8..6841af8a989c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1306,16 +1306,18 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) spin_unlock_irqrestore(&zwplug->lock, flags); bdev = bio->bi_bdev;
- submit_bio_noacct_nocheck(bio);
- /*
*/
- blk-mq devices will reuse the extra reference on the request queue
- usage counter we took when the BIO was plugged, but the submission
- path for BIO-based devices will not do that. So drop this extra
- reference here.
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO))
- if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
blk_queue_exit(bdev->bd_disk->queue);bdev->bd_disk->fops->submit_bio(bio);
- } else {
blk_mq_submit_bio(bio);
- }
put_zwplug: /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
This patch works fine on my test setup.
Thanks,
Bart.
On 5/23/25 2:10 AM, Damien Le Moal wrote:
On 5/22/25 19:38, Jens Axboe wrote:
On 5/22/25 11:14 AM, Bart Van Assche wrote:
blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20
Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche bvanassche@acm.org
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging() only before submitting the bio and not after it has been submitted.
block/blk-core.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..713fb3865260 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } +/*
- Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
- set because this causes blk_mq_freeze_queue() to deadlock if
- blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
- on the plug list is not necessary since a q_usage_counter reference is held
- while a bio is on the plug list.
- */
static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
- } else { struct gendisk *disk = bio->bi_bdev->bd_disk;
bool zwp = bio_zone_write_plugging(bio);
if (unlikely(!zwp && bio_queue_enter(bio) != 0))
if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) {goto finish_plug;
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); }
blk_queue_exit(disk->queue);
if (!zwp)
}blk_queue_exit(disk->queue);
This is pretty ugly, and I honestly absolutely hate how there's quite a bit of zoned_whatever sprinkling throughout the core code. What's the reason for not unplugging here, unaligned writes? Because you should presumable have the exact same issues on non-zoned devices if they have IO stuck in a plug (and doesn't get unplugged) while someone is waiting on a freeze.
A somewhat similar case was solved for IOPOLL and queue entering. That would be another thing to look at. Maybe a live enter could work if the plug itself pins it?
What about this patch, completely untested...
It's exactly the same thing, more zoned sprinkling in code that should not need to care. Only difference is that it moved to a function.
On Thu, May 22, 2025 at 10:14:05AM -0700, Bart Van Assche wrote:
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging() only before submitting the bio and not after it has been submitted.
And just like I told you for v1 we can't just bypass queue freezing.
And I'll ask you again to provide a reproducer and explain your findings. Without that you we are just waisting a lot of time and effort.
On 5/22/25 8:10 PM, Christoph Hellwig wrote:
And I'll ask you again to provide a reproducer and explain your findings. Without that you we are just waisting a lot of time and effort.
This link that points at a reproducer was already shared two days ago:
https://github.com/osandov/blktests/pull/171
I will post the reproducer in that pull request as a patch as requested in another email.
Bart.
linux-stable-mirror@lists.linaro.org