Zoned devices request sequential writing on the same zone. That means if 2 requests on the saem zone, the lower pos request need to dispatch to device first. While different priority has it's own tree & list, request with high priority will be disptch first. So if requestA & requestB are on the same zone. RequestA is BE and pos is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before requestA, which got an ERROR from zoned device.
This is found in a practice scenario when using F2FS on zoned device. And it is very easy to reproduce: 1. Use fsstress to run 8 test processes 2. Use ionice to change 4/8 processes to RT priority
Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: stable@vger.kernel.org Signed-off-by: Wu Bo bo.wu@vivo.com --- block/mq-deadline.c | 31 +++++++++++++++++++++++++++++++ include/linux/blk-mq.h | 15 +++++++++++++++ 2 files changed, 46 insertions(+)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 02a916ba62ee..6a05dd86e8ca 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -539,6 +539,37 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd, if (started_after(dd, rq, latest_start)) return NULL;
+ if (!blk_rq_is_seq_zoned_write(rq)) + goto skip_check; + /* + * To ensure sequential writing, check the lower priority class to see + * if there is a request on the same zone and need to be dispatched + * first + */ + ioprio_class = dd_rq_ioclass(rq); + prio = ioprio_class_to_prio[ioprio_class]; + prio++; + for (; prio <= DD_PRIO_MAX; prio++) { + struct request *temp_rq; + unsigned long flags; + bool can_dispatch; + + if (!dd_queued(dd, prio)) + continue; + + temp_rq = deadline_from_pos(&dd->per_prio[prio], data_dir, blk_rq_pos(rq)); + if (temp_rq && blk_req_zone_in_one(temp_rq, rq) && + blk_rq_pos(temp_rq) < blk_rq_pos(rq)) { + spin_lock_irqsave(&dd->zone_lock, flags); + can_dispatch = blk_req_can_dispatch_to_zone(temp_rq); + spin_unlock_irqrestore(&dd->zone_lock, flags); + if (!can_dispatch) + return NULL; + rq = temp_rq; + per_prio = &dd->per_prio[prio]; + } + } +skip_check: /* * rq is the selected appropriate request. */ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index d3d8fd8e229b..bca1e639e0f3 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -1202,6 +1202,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq) return true; return !blk_req_zone_is_write_locked(rq); } + +static inline bool blk_req_zone_in_one(struct request *rq_a, + struct request *rq_b) +{ + unsigned int zone_sectors = rq_a->q->limits.chunk_sectors; + + return round_down(blk_rq_pos(rq_a), zone_sectors) == + round_down(blk_rq_pos(rq_b), zone_sectors); +} #else /* CONFIG_BLK_DEV_ZONED */ static inline bool blk_rq_is_seq_zoned_write(struct request *rq) { @@ -1229,6 +1238,12 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq) { return true; } + +static inline bool blk_req_zone_in_one(struct request *rq_a, + struct request *rq_b) +{ + return false; +} #endif /* CONFIG_BLK_DEV_ZONED */
#endif /* BLK_MQ_H */
On 5/16/24 03:28, Wu Bo wrote:
Zoned devices request sequential writing on the same zone. That means if 2 requests on the saem zone, the lower pos request need to dispatch to device first. While different priority has it's own tree & list, request with high priority will be disptch first. So if requestA & requestB are on the same zone. RequestA is BE and pos is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before requestA, which got an ERROR from zoned device.
This is found in a practice scenario when using F2FS on zoned device. And it is very easy to reproduce:
- Use fsstress to run 8 test processes
- Use ionice to change 4/8 processes to RT priority
Hi Wu,
I agree that there is a problem related to the interaction of I/O priority and zoned storage. A solution with a lower runtime overhead is available here: https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@acm....
Are you OK with that alternative solution?
Thanks,
Bart.
On Thu, May 16, 2024 at 07:45:21AM -0600, Bart Van Assche wrote:
On 5/16/24 03:28, Wu Bo wrote:
Zoned devices request sequential writing on the same zone. That means if 2 requests on the saem zone, the lower pos request need to dispatch to device first. While different priority has it's own tree & list, request with high priority will be disptch first. So if requestA & requestB are on the same zone. RequestA is BE and pos is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before requestA, which got an ERROR from zoned device.
This is found in a practice scenario when using F2FS on zoned device. And it is very easy to reproduce:
- Use fsstress to run 8 test processes
- Use ionice to change 4/8 processes to RT priority
Hi Wu,
I agree that there is a problem related to the interaction of I/O priority and zoned storage. A solution with a lower runtime overhead is available here: https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@acm....
Hi Bart,
I have tried to set all seq write requests the same priority:
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 6a05dd86e8ca..b560846c63cb 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -841,7 +841,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, */ blk_req_zone_write_unlock(rq);
- prio = ioprio_class_to_prio[ioprio_class]; + if (blk_rq_is_seq_zoned_write(rq)) + prio = DD_BE_PRIO; + else + prio = ioprio_class_to_prio[ioprio_class]; per_prio = &dd->per_prio[prio]; if (!rq->elv.priv[0]) { per_prio->stats.inserted++;
I think this is the same effect as the patch you mentioned here. Unfortunatelly, this fix causes another issue. As all write requests are set to the same priority while read requests still have different priotities. This makes f2fs prone to hung when under stress test:
[129412.105440][T1100129] vkhungtaskd: INFO: task "f2fs_ckpt-254:5":769 blocked for more than 193 seconds. [129412.106629][T1100129] vkhungtaskd: 6.1.25-android14-11-maybe-dirty #1 [129412.107624][T1100129] vkhungtaskd: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129412.108873][T1100129] vkhungtaskd: task:f2fs_ckpt-254:5 state:D stack:10496 pid:769 ppid:2 flags:0x00000408 [129412.110194][T1100129] vkhungtaskd: Call trace: [129412.110769][T1100129] vkhungtaskd: __switch_to+0x174/0x338 [129412.111566][T1100129] vkhungtaskd: __schedule+0x604/0x9e4 [129412.112275][T1100129] vkhungtaskd: schedule+0x7c/0xe8 [129412.112938][T1100129] vkhungtaskd: rwsem_down_write_slowpath+0x4cc/0xf98 [129412.113813][T1100129] vkhungtaskd: down_write+0x38/0x40 [129412.114500][T1100129] vkhungtaskd: __write_checkpoint_sync+0x8c/0x11c [129412.115409][T1100129] vkhungtaskd: __checkpoint_and_complete_reqs+0x54/0x1dc [129412.116323][T1100129] vkhungtaskd: issue_checkpoint_thread+0x8c/0xec [129412.117148][T1100129] vkhungtaskd: kthread+0x110/0x224 [129412.117826][T1100129] vkhungtaskd: ret_from_fork+0x10/0x20 [129412.484027][T1700129] vkhungtaskd: task:f2fs_gc-254:55 state:D stack:10832 pid:771 ppid:2 flags:0x00000408 [129412.485337][T1700129] vkhungtaskd: Call trace: [129412.485906][T1700129] vkhungtaskd: __switch_to+0x174/0x338 [129412.486618][T1700129] vkhungtaskd: __schedule+0x604/0x9e4 [129412.487327][T1700129] vkhungtaskd: schedule+0x7c/0xe8 [129412.487985][T1700129] vkhungtaskd: io_schedule+0x38/0xc4 [129412.488675][T1700129] vkhungtaskd: folio_wait_bit_common+0x3d8/0x4f8 [129412.489496][T1700129] vkhungtaskd: __folio_lock+0x1c/0x2c [129412.490196][T1700129] vkhungtaskd: __folio_lock_io+0x24/0x44 [129412.490936][T1700129] vkhungtaskd: __filemap_get_folio+0x190/0x400 [129412.491736][T1700129] vkhungtaskd: pagecache_get_page+0x1c/0x5c [129412.492501][T1700129] vkhungtaskd: f2fs_wait_on_block_writeback+0x60/0xf8 [129412.493376][T1700129] vkhungtaskd: do_garbage_collect+0x1100/0x223c [129412.494185][T1700129] vkhungtaskd: f2fs_gc+0x284/0x778 [129412.494858][T1700129] vkhungtaskd: gc_thread_func+0x304/0x838 [129412.495603][T1700129] vkhungtaskd: kthread+0x110/0x224 [129412.496271][T1700129] vkhungtaskd: ret_from_fork+0x10/0x20
I think because f2fs is a CoW filesystem. Some threads holding lock need much reading & writing at the same time. Different reading & writing priority of this thread makes this process very long. And other FS operations will be blocked.
So I figured this solution to fix this priority issue on zoned device. It sure raises the overhead but can do fix it.
Thanks, Wu Bo
Are you OK with that alternative solution?
Thanks,
Bart.
On 5/16/24 18:44, Wu Bo wrote:
So I figured this solution to fix this priority issue on zoned device. It sure raises the overhead but can do fix it.
Something I should have realized earlier is that this patch is not necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned write plugging patch series has been merged. Hence, I/O schedulers, including the mq-deadline I/O schedulers, will only see a single zoned write at a time per zone. So it is no longer possible that zoned writes are reordered by the I/O scheduler because of their I/O priorities.
Thanks,
Bart.
On 2024/5/18 01:53, Bart Van Assche wrote:
On 5/16/24 18:44, Wu Bo wrote:
So I figured this solution to fix this priority issue on zoned device. It sure raises the overhead but can do fix it.
Something I should have realized earlier is that this patch is not necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned write plugging patch series has been merged. Hence, I/O schedulers, including the mq-deadline I/O schedulers, will only see a single zoned write at a time per zone. So it is no longer possible that zoned writes are reordered by the I/O scheduler because of their I/O priorities.
Hi Bart,
Yes, I noticed that 'zone write plugging' has been merged to latest branch. But it seems hard to backport to old version which mq-deadline priority feature has been merged. So is it possible to apply this fix to old versions?
Thanks, Wu Bo
Thanks,
Bart.
On 5/17/24 18:52, Wu Bo wrote:
Yes, I noticed that 'zone write plugging' has been merged to latest branch. But it seems hard to backport to old version which mq-deadline priority feature has been merged. So is it possible to apply this fix to old versions?
If you need this change in the Android kernel, please either submit a CL to the Android kernel repository or submit a request to Android Partner Engineering program.
Thanks,
Bart.
linux-stable-mirror@lists.linaro.org