On 5/15/25 9:47 PM, Christoph Hellwig wrote:
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
/* * Now assemble so we handle the lowest level first. */
bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same);bio_list_on_stack[0] = bio_list_on_stack[1];
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
If I read this code correctly, this means that we no keep processing bios that already were on bio_list_on_stack[0] and the beginning of the loop in the next iteration(s) instead of finishing off the ones created by this iteration, which could lead to exhaustion of resources like mempool.
Note that this is a big if - the code is really hard to read, it should really grow a data structure for the on-stack list that has named members for both lists instead of the array magic.. :(
I'm still trying to understand your problem given that it wasn't described much. What I could think it is that bio_split_to_limits through bio_submit_split first re-submits the remainder bio using submit_bio_noacct, which the above should place on the same list and then later the stacking block drivers also submit the bio split off at the beginning, unlike blk-mq drivers that process it directly. But given that this resubmission should be on the lower list above I don't see how it causes problems.
Agreed that this should be root-caused. To my own frustration I do not yet have a full root-cause analysis. What I have done to obtain more information is to make the kernel issue a warning the first time a bio is added out-of-order at the end of the bio list. The following output appeared (sde is the zoned block device at the bottom of the stack):
[ 71.312492][ T1] bio_list_insert_sorted: inserting in the middle of a bio list [ 71.313483][ T1] print_bio_list(sde) sector 0x1b7520 size 0x10 [ 71.313034][ T1] bio_list_insert_sorted(sde) sector 0x1b7120 size 0x400 [ ... ] [ 71.368117][ T163] WARNING: CPU: 4 PID: 163 at block/blk-core.c:725 bio_list_insert_sorted+0x144/0x18c [ 71.386664][ T163] Workqueue: writeback wb_workfn (flush-253:49) [ 71.387110][ T163] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 71.393772][ T163] Call trace: [ 71.393988][ T163] bio_list_insert_sorted+0x144/0x18c [ 71.394338][ T163] submit_bio_noacct_nocheck+0xd8/0x4f4 [ 71.394696][ T163] submit_bio_noacct+0x32c/0x50c [ 71.395017][ T163] bio_submit_split+0xf0/0x1f8 [ 71.395349][ T163] bio_split_rw+0xdc/0xf0 [ 71.395631][ T163] blk_mq_submit_bio+0x320/0x940 [ 71.395970][ T163] __submit_bio+0xa4/0x1c4 [ 71.396260][ T163] submit_bio_noacct_nocheck+0x1c0/0x4f4 [ 71.396623][ T163] submit_bio_noacct+0x32c/0x50c [ 71.396942][ T163] submit_bio+0x17c/0x198 [ 71.397222][ T163] f2fs_submit_write_bio+0x94/0x154 [ 71.397604][ T163] __submit_merged_bio+0x80/0x204 [ 71.397933][ T163] __submit_merged_write_cond+0xd0/0x1fc [ 71.398297][ T163] f2fs_submit_merged_write+0x24/0x30 [ 71.398646][ T163] f2fs_sync_node_pages+0x5ec/0x64c [ 71.398999][ T163] f2fs_write_node_pages+0xe8/0x1dc [ 71.399338][ T163] do_writepages+0xe4/0x2f8 [ 71.399673][ T163] __writeback_single_inode+0x84/0x6e4 [ 71.400036][ T163] writeback_sb_inodes+0x2cc/0x5c0 [ 71.400369][ T163] wb_writeback+0x134/0x550 [ 71.400662][ T163] wb_workfn+0x154/0x588 [ 71.400937][ T163] process_one_work+0x26c/0x65c [ 71.401271][ T163] worker_thread+0x33c/0x498 [ 71.401575][ T163] kthread+0x110/0x134 [ 71.401844][ T163] ret_from_fork+0x10/0x20
I think that the above call stack indicates the following: f2fs_submit_write_bio() submits a bio to a dm driver, that the dm driver submitted a bio for the lower driver (SCSI core), that the bio for the lower driver is split by bio_split_rw(), and that the second half of the split bio triggers the above out-of-order warning.
This new patch should address the concerns brought up in your latest email:
diff --git a/block/blk-core.c b/block/blk-core.c index 411f005e6b1f..aa270588272a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -649,6 +649,26 @@ static void __submit_bio(struct bio *bio) blk_finish_plug(&plug); }
+/* + * Insert a bio in LBA order. If no bio for the same bdev with a higher LBA is + * found, append at the end. + */ +static void bio_list_insert_sorted(struct bio_list *bl, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct bio **pprev = &bl->head, *next; + sector_t sector = bio->bi_iter.bi_sector; + + for (next = *pprev; next; pprev = &next->bi_next, next = next->bi_next) + if (next->bi_bdev == bdev && sector < next->bi_iter.bi_sector) + break; + + bio->bi_next = next; + *pprev = bio; + if (!next) + bl->tail = bio; +} + /* * The loop in this function may be a bit non-obvious, and so deserves some * explanation: @@ -706,7 +726,8 @@ static void __submit_bio_noacct(struct bio *bio) */ bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); + while ((bio = bio_list_pop(&bio_list_on_stack[1]))) + bio_list_insert_sorted(&bio_list_on_stack[0], bio); } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL; @@ -746,7 +767,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) * it is active, and then process them after it returned. */ if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); + bio_list_insert_sorted(¤t->bio_list[0], bio); else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) __submit_bio_noacct_mq(bio); else