[BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]---
[CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K |//////////////////////////////////////| | Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write.
- Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW.
- cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range().
[FIX] - Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio.
- Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking.
- Skip the failed COW range for the error handling
Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo wqu@suse.com --- fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 86 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 19c88b7d0363..bae8aceb3eae 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path, return ret < 0 ? ret : can_nocow; }
+static void cleanup_dirty_folios(struct btrfs_inode *inode, + struct folio *locked_folio, + u64 start, u64 end, int error) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + struct address_space *mapping = inode->vfs_inode.i_mapping; + pgoff_t start_index = start >> PAGE_SHIFT; + pgoff_t end_index = end >> PAGE_SHIFT; + u32 len; + + ASSERT(end + 1 - start < U32_MAX); + ASSERT(IS_ALIGNED(start, fs_info->sectorsize) && + IS_ALIGNED(end + 1, fs_info->sectorsize)); + len = end + 1 - start; + + /* + * Handle the locked folio first. + * btrfs_folio_clamp_*() helpers can handle range out of the folio case. + */ + btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len); + btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len); + btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len); + + for (pgoff_t index = start_index; index <= end_index; index++) { + struct folio *folio; + + /* Already handled at the beginning. */ + if (index == locked_folio->index) + continue; + folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS); + /* Cache already dropped, no need to do any cleanup. */ + if (IS_ERR(folio)) + continue; + btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len); + btrfs_folio_clamp_set_writeback(fs_info, folio, start, len); + btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len); + folio_unlock(folio); + folio_put(folio); + } + mapping_set_error(mapping, error); +} + /* * when nowcow writeback call back. This checks for snapshots or COW copies * of the extents that exist in the file, and COWs the file as required. @@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, struct btrfs_root *root = inode->root; struct btrfs_path *path; u64 cow_start = (u64)-1; + /* + * If not 0, represents the inclusive end of the last fallback_to_cow() + * range. Only for error handling. + */ + u64 cow_end = 0; u64 cur_offset = start; int ret; bool check_prev = true; @@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, found_key.offset - 1); cow_start = (u64)-1; if (ret) { + cow_end = found_key.offset - 1; btrfs_dec_nocow_writers(nocow_bg); goto error; } @@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, cow_start = cur_offset;
if (cow_start != (u64)-1) { - cur_offset = end; ret = fallback_to_cow(inode, locked_folio, cow_start, end); cow_start = (u64)-1; - if (ret) + if (ret) { + cow_end = end; goto error; + } }
btrfs_free_path(path); @@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
error: /* - * If an error happened while a COW region is outstanding, cur_offset - * needs to be reset to cow_start to ensure the COW region is unlocked - * as well. + * There are several error cases: + * + * 1) Failed without falling back to COW + * start cur_start end + * |/////////////| | + * + * For range [start, cur_start) the folios are already unlocked (except + * @locked_folio), EXTENT_DELALLOC already removed. + * Only need to clear the dirty flag as they will never be submitted. + * Ordered extent and extent maps are handled by + * btrfs_mark_ordered_io_finished() inside run_delalloc_range(). + * + * 2) Failed with error from fallback_to_cow() + * start cur_start cow_end end + * |/////////////|-----------| | + * + * For range [start, cur_start) it's the same as case 1). + * But for range [cur_start, cow_end), the folios have dirty flag + * cleared and unlocked, EXTENT_DEALLLOC cleared. + * There may or may not be any ordered extents/extent maps allocated. + * + * We should not call extent_clear_unlock_delalloc() on range [cur_start, + * cow_end), as the folios are already unlocked. + * + * So clear the folio dirty flags for [start, cur_offset) first. */ - if (cow_start != (u64)-1) - cur_offset = cow_start; + if (cur_offset > start) + cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret); + + /* + * If an error happened while a COW region is outstanding, cur_offset + * needs to be reset to @cow_end + 1 to skip the COW range, as + * cow_file_range() will do the proper cleanup at error. + */ + if (cow_end) + cur_offset = cow_end + 1;
/* * We need to lock the extent here because we're clearing DELALLOC and
On Thu, Dec 12, 2024 at 04:43:59PM +1030, Qu Wenruo wrote:
[BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]---
[CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K |//////////////////////////////////////| | Pre-allocated |
And the range [0, 16K) has a preallocated extent.
Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write.
Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW.
cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range().
[FIX]
Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio.
Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking.
Skip the failed COW range for the error handling
Cc: stable@vger.kernel.org
Reviewed-by: Boris Burkov boris@bur.io
Signed-off-by: Qu Wenruo wqu@suse.com
fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 86 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 19c88b7d0363..bae8aceb3eae 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path, return ret < 0 ? ret : can_nocow; }
I like this function. Can you add a simple doc with pre and post conditions please?
+static void cleanup_dirty_folios(struct btrfs_inode *inode,
struct folio *locked_folio,
u64 start, u64 end, int error)
+{
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct address_space *mapping = inode->vfs_inode.i_mapping;
- pgoff_t start_index = start >> PAGE_SHIFT;
- pgoff_t end_index = end >> PAGE_SHIFT;
- u32 len;
- ASSERT(end + 1 - start < U32_MAX);
- ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
IS_ALIGNED(end + 1, fs_info->sectorsize));
- len = end + 1 - start;
- /*
* Handle the locked folio first.
* btrfs_folio_clamp_*() helpers can handle range out of the folio case.
*/
- btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
- btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
- btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);
Could this clear dirty; set writeback; clear writeback sequence benefit from a good name and a helper function too?
- for (pgoff_t index = start_index; index <= end_index; index++) {
struct folio *folio;
/* Already handled at the beginning. */
if (index == locked_folio->index)
continue;
folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
/* Cache already dropped, no need to do any cleanup. */
if (IS_ERR(folio))
continue;
btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
folio_unlock(folio);
folio_put(folio);
- }
- mapping_set_error(mapping, error);
+}
/*
- when nowcow writeback call back. This checks for snapshots or COW copies
- of the extents that exist in the file, and COWs the file as required.
@@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, struct btrfs_root *root = inode->root; struct btrfs_path *path; u64 cow_start = (u64)-1;
- /*
* If not 0, represents the inclusive end of the last fallback_to_cow()
* range. Only for error handling.
*/
- u64 cow_end = 0; u64 cur_offset = start; int ret; bool check_prev = true;
@@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, found_key.offset - 1); cow_start = (u64)-1; if (ret) {
cow_end = found_key.offset - 1; btrfs_dec_nocow_writers(nocow_bg); goto error; }
@@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, cow_start = cur_offset; if (cow_start != (u64)-1) {
ret = fallback_to_cow(inode, locked_folio, cow_start, end); cow_start = (u64)-1;cur_offset = end;
if (ret)
if (ret) {
cow_end = end; goto error;
}}
btrfs_free_path(path); @@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, error: /*
* If an error happened while a COW region is outstanding, cur_offset
* needs to be reset to cow_start to ensure the COW region is unlocked
* as well.
* There are several error cases:
*
* 1) Failed without falling back to COW
* start cur_start end
* |/////////////| |
*
* For range [start, cur_start) the folios are already unlocked (except
* @locked_folio), EXTENT_DELALLOC already removed.
* Only need to clear the dirty flag as they will never be submitted.
* Ordered extent and extent maps are handled by
* btrfs_mark_ordered_io_finished() inside run_delalloc_range().
*
* 2) Failed with error from fallback_to_cow()
* start cur_start cow_end end
* |/////////////|-----------| |
*
* For range [start, cur_start) it's the same as case 1).
* But for range [cur_start, cow_end), the folios have dirty flag
* cleared and unlocked, EXTENT_DEALLLOC cleared.
* There may or may not be any ordered extents/extent maps allocated.
*
* We should not call extent_clear_unlock_delalloc() on range [cur_start,
* cow_end), as the folios are already unlocked.
*
I think it would be helpful to include cur_offset in your drawings.
*/* So clear the folio dirty flags for [start, cur_offset) first.
- if (cow_start != (u64)-1)
cur_offset = cow_start;
- if (cur_offset > start)
cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
- /*
* If an error happened while a COW region is outstanding, cur_offset
* needs to be reset to @cow_end + 1 to skip the COW range, as
* cow_file_range() will do the proper cleanup at error.
*/
- if (cow_end)
cur_offset = cow_end + 1;
/* * We need to lock the extent here because we're clearing DELALLOC and -- 2.47.1
在 2025/1/10 09:56, Boris Burkov 写道: [...]
I like this function. Can you add a simple doc with pre and post conditions please?
Sure, no problem.
Would be something like this:
/* * To cleanup dirty folios when failed to run a delalloc range. * * When running a delalloc range, we may need to split into * different extents (fragmentation or NOCOW limits), and if * we hit error, previous successfully executed ranges also need * to have their dirty flags cleared, with the address space marked * as error. */
+static void cleanup_dirty_folios(struct btrfs_inode *inode,
struct folio *locked_folio,
u64 start, u64 end, int error)
+{
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct address_space *mapping = inode->vfs_inode.i_mapping;
- pgoff_t start_index = start >> PAGE_SHIFT;
- pgoff_t end_index = end >> PAGE_SHIFT;
- u32 len;
- ASSERT(end + 1 - start < U32_MAX);
- ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
IS_ALIGNED(end + 1, fs_info->sectorsize));
- len = end + 1 - start;
- /*
* Handle the locked folio first.
* btrfs_folio_clamp_*() helpers can handle range out of the folio case.
*/
- btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
- btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
- btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);
Could this clear dirty; set writeback; clear writeback sequence benefit from a good name and a helper function too?
Sure, what about btrfs_folio_clamp_finish_io()?
- for (pgoff_t index = start_index; index <= end_index; index++) {
struct folio *folio;
/* Already handled at the beginning. */
if (index == locked_folio->index)
continue;
folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
/* Cache already dropped, no need to do any cleanup. */
if (IS_ERR(folio))
continue;
btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
folio_unlock(folio);
folio_put(folio);
- }
- mapping_set_error(mapping, error);
+}
- /*
- when nowcow writeback call back. This checks for snapshots or COW copies
- of the extents that exist in the file, and COWs the file as required.
@@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, struct btrfs_root *root = inode->root; struct btrfs_path *path; u64 cow_start = (u64)-1;
- /*
* If not 0, represents the inclusive end of the last fallback_to_cow()
* range. Only for error handling.
*/
- u64 cow_end = 0; u64 cur_offset = start; int ret; bool check_prev = true;
@@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, found_key.offset - 1); cow_start = (u64)-1; if (ret) {
cow_end = found_key.offset - 1; btrfs_dec_nocow_writers(nocow_bg); goto error; }
@@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, cow_start = cur_offset;
if (cow_start != (u64)-1) {
ret = fallback_to_cow(inode, locked_folio, cow_start, end); cow_start = (u64)-1;cur_offset = end;
if (ret)
if (ret) {
cow_end = end; goto error;
}
}
btrfs_free_path(path);
@@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
error: /*
* If an error happened while a COW region is outstanding, cur_offset
* needs to be reset to cow_start to ensure the COW region is unlocked
* as well.
* There are several error cases:
*
* 1) Failed without falling back to COW
* start cur_start end
* |/////////////| |
*
* For range [start, cur_start) the folios are already unlocked (except
* @locked_folio), EXTENT_DELALLOC already removed.
* Only need to clear the dirty flag as they will never be submitted.
* Ordered extent and extent maps are handled by
* btrfs_mark_ordered_io_finished() inside run_delalloc_range().
*
* 2) Failed with error from fallback_to_cow()
* start cur_start cow_end end
* |/////////////|-----------| |
*
* For range [start, cur_start) it's the same as case 1).
* But for range [cur_start, cow_end), the folios have dirty flag
* cleared and unlocked, EXTENT_DEALLLOC cleared.
* There may or may not be any ordered extents/extent maps allocated.
*
* We should not call extent_clear_unlock_delalloc() on range [cur_start,
* cow_end), as the folios are already unlocked.
*
I think it would be helpful to include cur_offset in your drawings.
I noticed this when crafting a new patch too, there is no @cur_start at all, but only @cur_offset.
Will fix it in the next update.
Thanks again for the detailed review, Qu
*/* So clear the folio dirty flags for [start, cur_offset) first.
- if (cow_start != (u64)-1)
cur_offset = cow_start;
if (cur_offset > start)
cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
/*
* If an error happened while a COW region is outstanding, cur_offset
* needs to be reset to @cow_end + 1 to skip the COW range, as
* cow_file_range() will do the proper cleanup at error.
*/
if (cow_end)
cur_offset = cow_end + 1;
/*
- We need to lock the extent here because we're clearing DELALLOC and
-- 2.47.1
linux-stable-mirror@lists.linaro.org