On Thu, May 16, 2024 at 07:45:21AM -0600, Bart Van Assche wrote:
On 5/16/24 03:28, Wu Bo wrote:
Zoned devices request sequential writing on the same zone. That means if 2 requests on the saem zone, the lower pos request need to dispatch to device first. While different priority has it's own tree & list, request with high priority will be disptch first. So if requestA & requestB are on the same zone. RequestA is BE and pos is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before requestA, which got an ERROR from zoned device.
This is found in a practice scenario when using F2FS on zoned device. And it is very easy to reproduce:
- Use fsstress to run 8 test processes
- Use ionice to change 4/8 processes to RT priority
Hi Wu,
I agree that there is a problem related to the interaction of I/O priority and zoned storage. A solution with a lower runtime overhead is available here: https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@acm....
Hi Bart,
I have tried to set all seq write requests the same priority:
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 6a05dd86e8ca..b560846c63cb 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -841,7 +841,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, */ blk_req_zone_write_unlock(rq);
- prio = ioprio_class_to_prio[ioprio_class]; + if (blk_rq_is_seq_zoned_write(rq)) + prio = DD_BE_PRIO; + else + prio = ioprio_class_to_prio[ioprio_class]; per_prio = &dd->per_prio[prio]; if (!rq->elv.priv[0]) { per_prio->stats.inserted++;
I think this is the same effect as the patch you mentioned here. Unfortunatelly, this fix causes another issue. As all write requests are set to the same priority while read requests still have different priotities. This makes f2fs prone to hung when under stress test:
[129412.105440][T1100129] vkhungtaskd: INFO: task "f2fs_ckpt-254:5":769 blocked for more than 193 seconds. [129412.106629][T1100129] vkhungtaskd: 6.1.25-android14-11-maybe-dirty #1 [129412.107624][T1100129] vkhungtaskd: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129412.108873][T1100129] vkhungtaskd: task:f2fs_ckpt-254:5 state:D stack:10496 pid:769 ppid:2 flags:0x00000408 [129412.110194][T1100129] vkhungtaskd: Call trace: [129412.110769][T1100129] vkhungtaskd: __switch_to+0x174/0x338 [129412.111566][T1100129] vkhungtaskd: __schedule+0x604/0x9e4 [129412.112275][T1100129] vkhungtaskd: schedule+0x7c/0xe8 [129412.112938][T1100129] vkhungtaskd: rwsem_down_write_slowpath+0x4cc/0xf98 [129412.113813][T1100129] vkhungtaskd: down_write+0x38/0x40 [129412.114500][T1100129] vkhungtaskd: __write_checkpoint_sync+0x8c/0x11c [129412.115409][T1100129] vkhungtaskd: __checkpoint_and_complete_reqs+0x54/0x1dc [129412.116323][T1100129] vkhungtaskd: issue_checkpoint_thread+0x8c/0xec [129412.117148][T1100129] vkhungtaskd: kthread+0x110/0x224 [129412.117826][T1100129] vkhungtaskd: ret_from_fork+0x10/0x20 [129412.484027][T1700129] vkhungtaskd: task:f2fs_gc-254:55 state:D stack:10832 pid:771 ppid:2 flags:0x00000408 [129412.485337][T1700129] vkhungtaskd: Call trace: [129412.485906][T1700129] vkhungtaskd: __switch_to+0x174/0x338 [129412.486618][T1700129] vkhungtaskd: __schedule+0x604/0x9e4 [129412.487327][T1700129] vkhungtaskd: schedule+0x7c/0xe8 [129412.487985][T1700129] vkhungtaskd: io_schedule+0x38/0xc4 [129412.488675][T1700129] vkhungtaskd: folio_wait_bit_common+0x3d8/0x4f8 [129412.489496][T1700129] vkhungtaskd: __folio_lock+0x1c/0x2c [129412.490196][T1700129] vkhungtaskd: __folio_lock_io+0x24/0x44 [129412.490936][T1700129] vkhungtaskd: __filemap_get_folio+0x190/0x400 [129412.491736][T1700129] vkhungtaskd: pagecache_get_page+0x1c/0x5c [129412.492501][T1700129] vkhungtaskd: f2fs_wait_on_block_writeback+0x60/0xf8 [129412.493376][T1700129] vkhungtaskd: do_garbage_collect+0x1100/0x223c [129412.494185][T1700129] vkhungtaskd: f2fs_gc+0x284/0x778 [129412.494858][T1700129] vkhungtaskd: gc_thread_func+0x304/0x838 [129412.495603][T1700129] vkhungtaskd: kthread+0x110/0x224 [129412.496271][T1700129] vkhungtaskd: ret_from_fork+0x10/0x20
I think because f2fs is a CoW filesystem. Some threads holding lock need much reading & writing at the same time. Different reading & writing priority of this thread makes this process very long. And other FS operations will be blocked.
So I figured this solution to fix this priority issue on zoned device. It sure raises the overhead but can do fix it.
Thanks, Wu Bo
Are you OK with that alternative solution?
Thanks,
Bart.