submit_bio() may be called recursively. To limit the stack depth, recursive calls result in bios being added to a list (current->bio_list). __submit_bio_noacct() sets up that list and maintains two lists with requests: * bio_list_on_stack[0] is the list with bios submitted by recursive submit_bio() calls from inside the latest __submit_bio() call. * bio_list_on_stack[1] is the list with bios submitted by recursive submit_bio() calls from inside previous __submit_bio() calls.
Make sure that bios are submitted to lower devices in the order these have been submitted by submit_bio() by adding new bios at the end of the list instead of at the front.
This patch fixes unaligned write errors that I encountered with F2FS submitting zoned writes to a dm driver stacked on top of a zoned UFS device.
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche bvanassche@acm.org --- block/blk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..4b728fa1c138 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -704,9 +704,9 @@ static void __submit_bio_noacct(struct bio *bio) /* * Now assemble so we handle the lowest level first. */ + bio_list_on_stack[0] = bio_list_on_stack[1]; bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL;
Hello Bart,
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
submit_bio() may be called recursively. To limit the stack depth, recursive calls result in bios being added to a list (current->bio_list). __submit_bio_noacct() sets up that list and maintains two lists with requests:
- bio_list_on_stack[0] is the list with bios submitted by recursive submit_bio() calls from inside the latest __submit_bio() call.
- bio_list_on_stack[1] is the list with bios submitted by recursive submit_bio() calls from inside previous __submit_bio() calls.
Make sure that bios are submitted to lower devices in the order these have been submitted by submit_bio() by adding new bios at the end of the list instead of at the front.
This patch fixes unaligned write errors that I encountered with F2FS submitting zoned writes to a dm driver stacked on top of a zoned UFS device.
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org
Here you add stable to Cc, but you don't specify either 1) a minimum version e.g. stable@vger.kernel.org # v6.8+ or 2) a Fixes tag.
Kind regards, Niklas
On 5/15/25 12:19 AM, Niklas Cassel wrote:
Hello Bart,
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
submit_bio() may be called recursively. To limit the stack depth, recursive calls result in bios being added to a list (current->bio_list). __submit_bio_noacct() sets up that list and maintains two lists with requests:
- bio_list_on_stack[0] is the list with bios submitted by recursive submit_bio() calls from inside the latest __submit_bio() call.
- bio_list_on_stack[1] is the list with bios submitted by recursive submit_bio() calls from inside previous __submit_bio() calls.
Make sure that bios are submitted to lower devices in the order these have been submitted by submit_bio() by adding new bios at the end of the list instead of at the front.
This patch fixes unaligned write errors that I encountered with F2FS submitting zoned writes to a dm driver stacked on top of a zoned UFS device.
Cc: Christoph Hellwig hch@lst.de Cc: Damien Le Moal dlemoal@kernel.org Cc: Yu Kuai yukuai1@huaweicloud.com Cc: Ming Lei ming.lei@redhat.com Cc: stable@vger.kernel.org
Here you add stable to Cc, but you don't specify either
- a minimum version e.g.
stable@vger.kernel.org # v6.8+ or 2) a Fixes tag.
Hi Niklas,
Let's add the following to this patch:
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Neil, since that commit was authored by you: the commit message is elaborate but the names of the drivers that needed that commit have not been mentioned. Which drivers needed that change? Additionally, can you please help with reviewing this patch:
https://lore.kernel.org/linux-block/20250514202937.2058598-2-bvanassche@acm....
Thanks,
Bart.
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
/* * Now assemble so we handle the lowest level first. */
bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same);bio_list_on_stack[0] = bio_list_on_stack[1];
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
If I read this code correctly, this means that we no keep processing bios that already were on bio_list_on_stack[0] and the beginning of the loop in the next iteration(s) instead of finishing off the ones created by this iteration, which could lead to exhaustion of resources like mempool.
Note that this is a big if - the code is really hard to read, it should really grow a data structure for the on-stack list that has named members for both lists instead of the array magic.. :(
I'm still trying to understand your problem given that it wasn't described much. What I could think it is that bio_split_to_limits through bio_submit_split first re-submits the remainder bio using submit_bio_noacct, which the above should place on the same list and then later the stacking block drivers also submit the bio split off at the beginning, unlike blk-mq drivers that process it directly. But given that this resubmission should be on the lower list above I don't see how it causes problems.
On 5/15/25 9:47 PM, Christoph Hellwig wrote:
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
/* * Now assemble so we handle the lowest level first. */
bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same);bio_list_on_stack[0] = bio_list_on_stack[1];
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
If I read this code correctly, this means that we no keep processing bios that already were on bio_list_on_stack[0] and the beginning of the loop in the next iteration(s) instead of finishing off the ones created by this iteration, which could lead to exhaustion of resources like mempool.
Note that this is a big if - the code is really hard to read, it should really grow a data structure for the on-stack list that has named members for both lists instead of the array magic.. :(
I'm still trying to understand your problem given that it wasn't described much. What I could think it is that bio_split_to_limits through bio_submit_split first re-submits the remainder bio using submit_bio_noacct, which the above should place on the same list and then later the stacking block drivers also submit the bio split off at the beginning, unlike blk-mq drivers that process it directly. But given that this resubmission should be on the lower list above I don't see how it causes problems.
Agreed that this should be root-caused. To my own frustration I do not yet have a full root-cause analysis. What I have done to obtain more information is to make the kernel issue a warning the first time a bio is added out-of-order at the end of the bio list. The following output appeared (sde is the zoned block device at the bottom of the stack):
[ 71.312492][ T1] bio_list_insert_sorted: inserting in the middle of a bio list [ 71.313483][ T1] print_bio_list(sde) sector 0x1b7520 size 0x10 [ 71.313034][ T1] bio_list_insert_sorted(sde) sector 0x1b7120 size 0x400 [ ... ] [ 71.368117][ T163] WARNING: CPU: 4 PID: 163 at block/blk-core.c:725 bio_list_insert_sorted+0x144/0x18c [ 71.386664][ T163] Workqueue: writeback wb_workfn (flush-253:49) [ 71.387110][ T163] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 71.393772][ T163] Call trace: [ 71.393988][ T163] bio_list_insert_sorted+0x144/0x18c [ 71.394338][ T163] submit_bio_noacct_nocheck+0xd8/0x4f4 [ 71.394696][ T163] submit_bio_noacct+0x32c/0x50c [ 71.395017][ T163] bio_submit_split+0xf0/0x1f8 [ 71.395349][ T163] bio_split_rw+0xdc/0xf0 [ 71.395631][ T163] blk_mq_submit_bio+0x320/0x940 [ 71.395970][ T163] __submit_bio+0xa4/0x1c4 [ 71.396260][ T163] submit_bio_noacct_nocheck+0x1c0/0x4f4 [ 71.396623][ T163] submit_bio_noacct+0x32c/0x50c [ 71.396942][ T163] submit_bio+0x17c/0x198 [ 71.397222][ T163] f2fs_submit_write_bio+0x94/0x154 [ 71.397604][ T163] __submit_merged_bio+0x80/0x204 [ 71.397933][ T163] __submit_merged_write_cond+0xd0/0x1fc [ 71.398297][ T163] f2fs_submit_merged_write+0x24/0x30 [ 71.398646][ T163] f2fs_sync_node_pages+0x5ec/0x64c [ 71.398999][ T163] f2fs_write_node_pages+0xe8/0x1dc [ 71.399338][ T163] do_writepages+0xe4/0x2f8 [ 71.399673][ T163] __writeback_single_inode+0x84/0x6e4 [ 71.400036][ T163] writeback_sb_inodes+0x2cc/0x5c0 [ 71.400369][ T163] wb_writeback+0x134/0x550 [ 71.400662][ T163] wb_workfn+0x154/0x588 [ 71.400937][ T163] process_one_work+0x26c/0x65c [ 71.401271][ T163] worker_thread+0x33c/0x498 [ 71.401575][ T163] kthread+0x110/0x134 [ 71.401844][ T163] ret_from_fork+0x10/0x20
I think that the above call stack indicates the following: f2fs_submit_write_bio() submits a bio to a dm driver, that the dm driver submitted a bio for the lower driver (SCSI core), that the bio for the lower driver is split by bio_split_rw(), and that the second half of the split bio triggers the above out-of-order warning.
This new patch should address the concerns brought up in your latest email:
diff --git a/block/blk-core.c b/block/blk-core.c index 411f005e6b1f..aa270588272a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -649,6 +649,26 @@ static void __submit_bio(struct bio *bio) blk_finish_plug(&plug); }
+/* + * Insert a bio in LBA order. If no bio for the same bdev with a higher LBA is + * found, append at the end. + */ +static void bio_list_insert_sorted(struct bio_list *bl, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct bio **pprev = &bl->head, *next; + sector_t sector = bio->bi_iter.bi_sector; + + for (next = *pprev; next; pprev = &next->bi_next, next = next->bi_next) + if (next->bi_bdev == bdev && sector < next->bi_iter.bi_sector) + break; + + bio->bi_next = next; + *pprev = bio; + if (!next) + bl->tail = bio; +} + /* * The loop in this function may be a bit non-obvious, and so deserves some * explanation: @@ -706,7 +726,8 @@ static void __submit_bio_noacct(struct bio *bio) */ bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); + while ((bio = bio_list_pop(&bio_list_on_stack[1]))) + bio_list_insert_sorted(&bio_list_on_stack[0], bio); } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL; @@ -746,7 +767,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) * it is active, and then process them after it returned. */ if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); + bio_list_insert_sorted(¤t->bio_list[0], bio); else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) __submit_bio_noacct_mq(bio); else
On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote:
This new patch should address the concerns brought up in your latest email:
No, we should never need to do a sort, as mentioned we need to fix how stackable drivers split the I/O. Or maybe even get them out of all the splits that aren't required.
On 5/20/25 6:56 AM, Christoph Hellwig wrote:
On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote:
This new patch should address the concerns brought up in your latest email:
No, we should never need to do a sort, as mentioned we need to fix how stackable drivers split the I/O. Or maybe even get them out of all the splits that aren't required.
If the sequential write bios are split by the device mapper, sorting bios in the block layer is not necessary. Christoph and Damien, do you agree to replace the bio sorting code in my previous email with the patch below?
Thanks,
Bart.
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3d419fd2be57..c41ab294987e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1792,9 +1792,12 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md, { /* * For mapped device that need zone append emulation, we must - * split any large BIO that straddles zone boundaries. + * split any large BIO that straddles zone boundaries. Additionally, + * split sequential writes to prevent that splitting lower in the stack + * causes reordering. */ - return dm_emulate_zone_append(md) && bio_straddles_zones(bio) && + return ((dm_emulate_zone_append(md) && bio_straddles_zones(bio)) || + bio_op(bio) == REQ_OP_WRITE) && !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); } static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote:
If the sequential write bios are split by the device mapper, sorting bios in the block layer is not necessary. Christoph and Damien, do you agree to replace the bio sorting code in my previous email with the patch below?
No. First please create a reproducer for your issue using null_blk or scsi_debug, otherwise we have no way to understand what is going on here, and will regress in the future.
Second should very much be able to fix the splitting in dm to place the bios in the right order. As mentioned before I have a theory of how to do it, but we really need a proper reproducer to test this and then to write it up to blktests first.
On 5/20/25 10:53 PM, Christoph Hellwig wrote:
On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote:
If the sequential write bios are split by the device mapper, sorting bios in the block layer is not necessary. Christoph and Damien, do you agree to replace the bio sorting code in my previous email with the patch below?
No. First please create a reproducer for your issue using null_blk or scsi_debug, otherwise we have no way to understand what is going on here, and will regress in the future.
Second should very much be able to fix the splitting in dm to place the bios in the right order. As mentioned before I have a theory of how to do it, but we really need a proper reproducer to test this and then to write it up to blktests first.
Hi Christoph,
The following pull request includes a test that triggers the deadlock fixed by patch 2/2 reliably:
https://github.com/osandov/blktests/pull/171
I do not yet have a reproducer for the bio reordering but I'm still working on this.
Bart.
On Wed, May 21, 2025 at 02:18:12PM -0700, Bart Van Assche wrote:
The following pull request includes a test that triggers the deadlock fixed by patch 2/2 reliably:
Can you post the pathc please? I've not found any obvious way to download a patch from the above website.
linux-stable-mirror@lists.linaro.org