On Thu, May 04, 2023 at 12:34:25AM +0800, Yue Zhao wrote:
Recently we found a bug related with ext4 buffer head is fixed by commit 0b73284c564d("ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodate")[1].
This bug is fixed on some kernel long term versions, such as 5.10 and 5.15. However, on 5.4 stable version, we can still easily reproduce this bug by adding some delay after buffer_migrate_lock_buffers() in __buffer_migrate_page() and do fsstress on the ext4 filesystem. We can get some errors in dmesg like:
EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193: comm fsstress: reading directory lblock 0 EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334: comm fsstress: reading directory lblock 0
About how to fix this bug in 5.4 version, currently I have three ideas. But I don't know which one is better or is there any other feasible way to fix this bug elegantly based on the 5.4 stable branch?
The first idea comes from this thread[2]. In __buffer_migrate_page(), we can let it fallback to migrate_page that are not uptodate like fallback_migrate_page(), those pages that has buffers may probably do read operation soon. From [3], we can see this solution is not good enough because there are other places that lock the buffer without doing IO. I think this solution can be a candidate option to fix if we do not want to change a lot. Also based on my test results, the ext4 filesystem remains stable after one week stress test with this patch applied.
The second idea is backport a series of commits from upstream, such as
2d069c0889ef ("ext4: use common helpers in all places reading metadata buffers") 0b73284c564d ("ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodate") 79f597842069 ("fs/buffer: remove ll_rw_block() helper")
Backporting the original upstream commits is almost always the correct solution. Please try doing that instead of a one-off patch like this.
thanks,
greg k-h