The patch below does not apply to the 6.12-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y git checkout FETCH_HEAD git cherry-pick -x e9e3b22ddfa760762b696ac6417c8d6edd182e49 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2026011242-empirical-gullible-4683@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e9e3b22ddfa760762b696ac6417c8d6edd182e49 Mon Sep 17 00:00:00 2001 From: Qu Wenruo wqu@suse.com Date: Thu, 11 Dec 2025 12:45:17 +1030 Subject: [PATCH] btrfs: fix beyond-EOF write handling
[BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt
This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
Note for file offset 60K, we're inserting a file extent without any data checksum.
Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE] The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K |/////////////| |//| |//|
- Truncate the inode to 16K Which will trigger writeback through:
btrfs_setsize() |- truncate_setsize() | Now the inode size is set to 16K | |- btrfs_truncate() |- btrfs_wait_ordered_range() for [16K, u64(-1)] |- btrfs_fdatawrite_range() for [16K, u64(-1)} |- extent_writepage() for folio 0 |- writepage_delalloc() | Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) | |- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled differently:
- Submit write for range [0, 16K) As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum.
- Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range.
[FIX] The bug itself was introduced a long time ago, way before subpage and large folio support.
At that time, fs block size must match page size, thus the range [cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Qu Wenruo wqu@suse.com Signed-off-by: David Sterba dsterba@suse.com
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2d32dfc34ae3..97748d0d54d9 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, struct btrfs_ordered_extent *ordered;
ordered = btrfs_lookup_first_ordered_range(inode, cur, - folio_end - cur); + fs_info->sectorsize); /* * We have just run delalloc before getting here, so * there must be an ordered extent. @@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, btrfs_put_ordered_extent(ordered);
btrfs_mark_ordered_io_finished(inode, folio, cur, - end - cur, true); + fs_info->sectorsize, true); /* * This range is beyond i_size, thus we don't need to * bother writing back. @@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, * writeback the sectors with subpage dirty bits, * causing writeback without ordered extent. */ - btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); - break; + btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize); + continue; } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); if (unlikely(ret < 0)) {
From: Qu Wenruo wqu@suse.com
[ Upstream commit a7858d5c36cae52eaf3048490b05c0b19086073b ]
[BUG] If we failed to compress the range, or cannot reserve a large enough data extent (e.g. too fragmented free space), we will fall back to submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is smaller than folio size, as the next time the folio gets writeback/lock updates, subpage will find the bitmap already have the range set, triggering an ASSERT().
[CAUSE] Bug 1) happens in the following call chain:
submit_uncompressed_range() |- run_delalloc_cow() | |- cow_file_range() | |- btrfs_reserve_extent() | Failed with -ENOSPC or whatever error | |- btrfs_clean_up_ordered_extents() | |- btrfs_mark_ordered_io_finished() | Which cleans all the ordered extents in the async_extent range. | |- btrfs_mark_ordered_io_finished() Which cleans the folio range.
The finished ordered extents may not be immediately removed from the ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(), folio_start_writeback() and folio_end_writeback(), other than the helpers which handle subpage cases.
[FIX] For bug 1) since the first btrfs_cleanup_ordered_extents() call is handling the whole range, we should not do the second btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to pass the @locked_page parameter, as we are already in the async extent context, thus will never rely on the error handling inside btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio equally.
For bug 2) we should not even call folio_start_writeback()/folio_end_writeback() anymore. As the error handling protocol, cow_file_range() should clear dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock() helper.
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Qu Wenruo wqu@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/inode.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ce13b0ec978ed..38323620b819e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1159,19 +1159,10 @@ static void submit_uncompressed_range(struct btrfs_inode *inode, &wbc, false); wbc_detach_inode(&wbc); if (ret < 0) { - btrfs_cleanup_ordered_extents(inode, locked_folio, - start, end - start + 1); - if (locked_folio) { - const u64 page_start = folio_pos(locked_folio); - - folio_start_writeback(locked_folio); - folio_end_writeback(locked_folio); - btrfs_mark_ordered_io_finished(inode, locked_folio, - page_start, PAGE_SIZE, - !ret); - mapping_set_error(locked_folio->mapping, ret); - folio_unlock(locked_folio); - } + btrfs_cleanup_ordered_extents(inode, NULL, start, end - start + 1); + if (locked_folio) + btrfs_folio_end_lock(inode->root->fs_info, locked_folio, + start, async_extent->ram_size); } }
From: Qu Wenruo wqu@suse.com
[ Upstream commit 61d730731b47eeee42ad11fc71e145d269acab8d ]
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call bitmap_test_range_all_zero() to ensure the involved range has no dirty/lock bit already set.
However with my recent enhanced delalloc range error handling, I was hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out that some error handling path is not properly updating the folio flags.
So add some extra dumping for the ASSERTs to dump the involved bitmap to help debug.
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/subpage.c | 41 ++++++++++++++++++++++++++++++----------- 1 file changed, 30 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index 71a56aaac7ad2..a1a358addc581 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -652,6 +652,28 @@ IMPLEMENT_BTRFS_PAGE_OPS(ordered, folio_set_ordered, folio_clear_ordered, IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked, folio_test_checked);
+#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst) \ +{ \ + const int sectors_per_page = fs_info->sectors_per_page; \ + \ + ASSERT(sectors_per_page < BITS_PER_LONG); \ + *dst = bitmap_read(subpage->bitmaps, \ + sectors_per_page * btrfs_bitmap_nr_##name, \ + sectors_per_page); \ +} + +#define SUBPAGE_DUMP_BITMAP(fs_info, folio, name, start, len) \ +{ \ + const struct btrfs_subpage *subpage = folio_get_private(folio); \ + unsigned long bitmap; \ + \ + GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap); \ + btrfs_warn(fs_info, \ + "dumpping bitmap start=%llu len=%u folio=%llu " #name "_bitmap=%*pbl", \ + start, len, folio_pos(folio), \ + fs_info->sectors_per_page, &bitmap); \ +} + /* * Make sure not only the page dirty bit is cleared, but also subpage dirty bit * is cleared. @@ -677,6 +699,10 @@ void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, subpage = folio_get_private(folio); ASSERT(subpage); spin_lock_irqsave(&subpage->lock, flags); + if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) { + SUBPAGE_DUMP_BITMAP(fs_info, folio, dirty, start, len); + ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); + } ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); spin_unlock_irqrestore(&subpage->lock, flags); } @@ -706,23 +732,16 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info, nbits = len >> fs_info->sectorsize_bits; spin_lock_irqsave(&subpage->lock, flags); /* Target range should not yet be locked. */ - ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); + if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) { + SUBPAGE_DUMP_BITMAP(fs_info, folio, locked, start, len); + ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); + } bitmap_set(subpage->bitmaps, start_bit, nbits); ret = atomic_add_return(nbits, &subpage->nr_locked); ASSERT(ret <= fs_info->sectors_per_page); spin_unlock_irqrestore(&subpage->lock, flags); }
-#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst) \ -{ \ - const int sectors_per_page = fs_info->sectors_per_page; \ - \ - ASSERT(sectors_per_page < BITS_PER_LONG); \ - *dst = bitmap_read(subpage->bitmaps, \ - sectors_per_page * btrfs_bitmap_nr_##name, \ - sectors_per_page); \ -} - void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) {
From: Qu Wenruo wqu@suse.com
[ Upstream commit 975a6a8855f45729a0fbfe2a8f2df2d3faef2a97 ]
All the error handling bugs I hit so far are all -ENOSPC from either:
- cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range()
Previously when those functions failed, there was no error message at all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc().
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 15 +++++++++++++++ fs/btrfs/inode.c | 12 ++++++++++++ 2 files changed, 27 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d8d9f4c95c7ab..4c5288251f78f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1322,6 +1322,15 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, wbc); if (ret >= 0) last_finished_delalloc_end = found_start + found_len; + if (unlikely(ret < 0)) + btrfs_err_rl(fs_info, +"failed to run delalloc range, root=%lld ino=%llu folio=%llu submit_bitmap=%*pbl start=%llu len=%u: %d", + btrfs_root_id(inode->root), + btrfs_ino(inode), + folio_pos(folio), + fs_info->sectors_per_page, + &bio_ctrl->submit_bitmap, + found_start, found_len, ret); } else { /* * We've hit an error during previous delalloc range, @@ -1621,6 +1630,12 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl PAGE_SIZE, bio_ctrl, i_size); if (ret == 1) return 0; + if (ret < 0) + btrfs_err_rl(fs_info, +"failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d", + btrfs_root_id(inode->root), btrfs_ino(inode), + folio_pos(folio), fs_info->sectors_per_page, + &bio_ctrl->submit_bitmap, ret);
bio_ctrl->wbc->nr_to_write--;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 38323620b819e..b1d450459f736 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1163,6 +1163,10 @@ static void submit_uncompressed_range(struct btrfs_inode *inode, if (locked_folio) btrfs_folio_end_lock(inode->root->fs_info, locked_folio, start, async_extent->ram_size); + btrfs_err_rl(inode->root->fs_info, + "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d", + __func__, btrfs_root_id(inode->root), + btrfs_ino(inode), start, async_extent->ram_size, ret); } }
@@ -1623,6 +1627,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode, &cached, clear_bits, page_ops); btrfs_qgroup_free_data(inode, NULL, start, end - start + 1, NULL); } + btrfs_err_rl(fs_info, + "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d", + __func__, btrfs_root_id(inode->root), + btrfs_ino(inode), orig_start, end + 1 - orig_start, ret); return ret; }
@@ -2373,6 +2381,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, btrfs_qgroup_free_data(inode, NULL, cur_offset, end - cur_offset + 1, NULL); } btrfs_free_path(path); + btrfs_err_rl(fs_info, + "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d", + __func__, btrfs_root_id(inode->root), + btrfs_ino(inode), start, end + 1 - start, ret); return ret; }
From: Qu Wenruo wqu@suse.com
[ Upstream commit 619611e87fcca1fdaa67c2bf6b030863ab90216e ]
For the future large folio support, our filemap can have folios with different sizes, thus we can no longer rely on a fixed blocks_per_page value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio() Which uses the folio size to calculate the number of blocks for each folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that helper There are some exceptions:
* Metadata nodesize < page size support In the future, even if we support large folios, we will only allocate a folio that matches our nodesize. Thus we won't have a folio covering multiple metadata unless nodesize < page size.
* Existing subpage bitmap dump We use a single unsigned long to store the bitmap. That means until we change the bitmap dumping code, our upper limit for folio size will only be 256K (4K block size, 64 bit unsigned long).
* btrfs_is_subpage() check This will be migrated into a future patch.
Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/disk-io.c | 1 - fs/btrfs/extent_io.c | 26 ++++++----- fs/btrfs/fs.h | 7 ++- fs/btrfs/subpage.c | 104 ++++++++++++++++++++++++++----------------- 4 files changed, 84 insertions(+), 54 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3a73d218af464..39fe4385ed361 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3320,7 +3320,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device fs_info->nodesize = nodesize; fs_info->sectorsize = sectorsize; fs_info->sectorsize_bits = ilog2(sectorsize); - fs_info->sectors_per_page = (PAGE_SIZE >> fs_info->sectorsize_bits); fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size; fs_info->stripesize = stripesize;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4c5288251f78f..1ca890f02da91 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1182,7 +1182,7 @@ static bool find_next_delalloc_bitmap(struct folio *folio, { struct btrfs_fs_info *fs_info = folio_to_fs_info(folio); const u64 folio_start = folio_pos(folio); - const unsigned int bitmap_size = fs_info->sectors_per_page; + const unsigned int bitmap_size = btrfs_blocks_per_folio(fs_info, folio); unsigned int start_bit; unsigned int first_zero; unsigned int first_set; @@ -1224,6 +1224,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, const bool is_subpage = btrfs_is_subpage(fs_info, folio->mapping); const u64 page_start = folio_pos(folio); const u64 page_end = page_start + folio_size(folio) - 1; + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); unsigned long delalloc_bitmap = 0; /* * Save the last found delalloc end. As the delalloc end can go beyond @@ -1249,13 +1250,13 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
/* Save the dirty bitmap as our submission bitmap will be a subset of it. */ if (btrfs_is_subpage(fs_info, inode->vfs_inode.i_mapping)) { - ASSERT(fs_info->sectors_per_page > 1); + ASSERT(blocks_per_folio > 1); btrfs_get_subpage_dirty_bitmap(fs_info, folio, &bio_ctrl->submit_bitmap); } else { bio_ctrl->submit_bitmap = 1; }
- for_each_set_bit(bit, &bio_ctrl->submit_bitmap, fs_info->sectors_per_page) { + for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { u64 start = page_start + (bit << fs_info->sectorsize_bits);
btrfs_folio_set_lock(fs_info, folio, start, fs_info->sectorsize); @@ -1328,7 +1329,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, btrfs_root_id(inode->root), btrfs_ino(inode), folio_pos(folio), - fs_info->sectors_per_page, + blocks_per_folio, &bio_ctrl->submit_bitmap, found_start, found_len, ret); } else { @@ -1373,7 +1374,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, unsigned int bitmap_size = min( (last_finished_delalloc_end - page_start) >> fs_info->sectorsize_bits, - fs_info->sectors_per_page); + blocks_per_folio);
for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size) btrfs_mark_ordered_io_finished(inode, folio, @@ -1397,7 +1398,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, * If all ranges are submitted asynchronously, we just need to account * for them here. */ - if (bitmap_empty(&bio_ctrl->submit_bitmap, fs_info->sectors_per_page)) { + if (bitmap_empty(&bio_ctrl->submit_bitmap, blocks_per_folio)) { wbc->nr_to_write -= delalloc_to_write; return 1; } @@ -1498,6 +1499,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, bool submitted_io = false; int found_error = 0; const u64 folio_start = folio_pos(folio); + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); u64 cur; int bit; int ret = 0; @@ -1516,11 +1518,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, for (cur = start; cur < start + len; cur += fs_info->sectorsize) set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap); bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap, - fs_info->sectors_per_page); + blocks_per_folio);
bio_ctrl->end_io_func = end_bbio_data_write;
- for_each_set_bit(bit, &bio_ctrl->submit_bitmap, fs_info->sectors_per_page) { + for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
if (cur >= i_size) { @@ -1595,6 +1597,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl size_t pg_offset; loff_t i_size = i_size_read(&inode->vfs_inode); unsigned long end_index = i_size >> PAGE_SHIFT; + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
trace_extent_writepage(folio, &inode->vfs_inode, bio_ctrl->wbc);
@@ -1634,7 +1637,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl btrfs_err_rl(fs_info, "failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d", btrfs_root_id(inode->root), btrfs_ino(inode), - folio_pos(folio), fs_info->sectors_per_page, + folio_pos(folio), blocks_per_folio, &bio_ctrl->submit_bitmap, ret);
bio_ctrl->wbc->nr_to_write--; @@ -1929,9 +1932,10 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc) u64 folio_start = folio_pos(folio); int bit_start = 0; int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits; + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
/* Lock and write each dirty extent buffers in the range */ - while (bit_start < fs_info->sectors_per_page) { + while (bit_start < blocks_per_folio) { struct btrfs_subpage *subpage = folio_get_private(folio); struct extent_buffer *eb; unsigned long flags; @@ -1947,7 +1951,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc) break; } spin_lock_irqsave(&subpage->lock, flags); - if (!test_bit(bit_start + btrfs_bitmap_nr_dirty * fs_info->sectors_per_page, + if (!test_bit(bit_start + btrfs_bitmap_nr_dirty * blocks_per_folio, subpage->bitmaps)) { spin_unlock_irqrestore(&subpage->lock, flags); spin_unlock(&folio->mapping->i_private_lock); diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 374843aca60d8..5c8d6149e1421 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -708,7 +708,6 @@ struct btrfs_fs_info { * running. */ refcount_t scrub_workers_refcnt; - u32 sectors_per_page; struct workqueue_struct *scrub_workers;
struct btrfs_discard_ctl discard_ctl; @@ -976,6 +975,12 @@ static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 siz return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size); }
+static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info, + const struct folio *folio) +{ + return folio_size(folio) >> fs_info->sectorsize_bits; +} + bool btrfs_exclop_start(struct btrfs_fs_info *fs_info, enum btrfs_exclusive_operation type); bool btrfs_exclop_start_try_lock(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index a1a358addc581..7e5ecc12b732a 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -93,6 +93,9 @@ int btrfs_attach_subpage(const struct btrfs_fs_info *fs_info, { struct btrfs_subpage *subpage;
+ /* For metadata we don't support large folio yet. */ + ASSERT(!folio_test_large(folio)); + /* * We have cases like a dummy extent buffer page, which is not mapped * and doesn't need to be locked. @@ -134,7 +137,8 @@ struct btrfs_subpage *btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info, ASSERT(fs_info->sectorsize < PAGE_SIZE);
real_size = struct_size(ret, bitmaps, - BITS_TO_LONGS(btrfs_bitmap_nr_max * fs_info->sectors_per_page)); + BITS_TO_LONGS(btrfs_bitmap_nr_max * + (PAGE_SIZE >> fs_info->sectorsize_bits))); ret = kzalloc(real_size, GFP_NOFS); if (!ret) return ERR_PTR(-ENOMEM); @@ -211,11 +215,13 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
#define subpage_calc_start_bit(fs_info, folio, name, start, len) \ ({ \ - unsigned int __start_bit; \ + unsigned int __start_bit; \ + const unsigned int blocks_per_folio = \ + btrfs_blocks_per_folio(fs_info, folio); \ \ btrfs_subpage_assert(fs_info, folio, start, len); \ __start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \ - __start_bit += fs_info->sectors_per_page * btrfs_bitmap_nr_##name; \ + __start_bit += blocks_per_folio * btrfs_bitmap_nr_##name; \ __start_bit; \ })
@@ -323,7 +329,8 @@ void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, struct folio *folio, unsigned long bitmap) { struct btrfs_subpage *subpage = folio_get_private(folio); - const int start_bit = fs_info->sectors_per_page * btrfs_bitmap_nr_locked; + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); + const int start_bit = blocks_per_folio * btrfs_bitmap_nr_locked; unsigned long flags; bool last = false; int cleared = 0; @@ -341,7 +348,7 @@ void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, }
spin_lock_irqsave(&subpage->lock, flags); - for_each_set_bit(bit, &bitmap, fs_info->sectors_per_page) { + for_each_set_bit(bit, &bitmap, blocks_per_folio) { if (test_and_clear_bit(bit + start_bit, subpage->bitmaps)) cleared++; } @@ -352,15 +359,27 @@ void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, folio_unlock(folio); }
-#define subpage_test_bitmap_all_set(fs_info, subpage, name) \ +#define subpage_test_bitmap_all_set(fs_info, folio, name) \ +({ \ + struct btrfs_subpage *subpage = folio_get_private(folio); \ + const unsigned int blocks_per_folio = \ + btrfs_blocks_per_folio(fs_info, folio); \ + \ bitmap_test_range_all_set(subpage->bitmaps, \ - fs_info->sectors_per_page * btrfs_bitmap_nr_##name, \ - fs_info->sectors_per_page) + blocks_per_folio * btrfs_bitmap_nr_##name, \ + blocks_per_folio); \ +})
-#define subpage_test_bitmap_all_zero(fs_info, subpage, name) \ +#define subpage_test_bitmap_all_zero(fs_info, folio, name) \ +({ \ + struct btrfs_subpage *subpage = folio_get_private(folio); \ + const unsigned int blocks_per_folio = \ + btrfs_blocks_per_folio(fs_info, folio); \ + \ bitmap_test_range_all_zero(subpage->bitmaps, \ - fs_info->sectors_per_page * btrfs_bitmap_nr_##name, \ - fs_info->sectors_per_page) + blocks_per_folio * btrfs_bitmap_nr_##name, \ + blocks_per_folio); \ +})
void btrfs_subpage_set_uptodate(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) @@ -372,7 +391,7 @@ void btrfs_subpage_set_uptodate(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&subpage->lock, flags); bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - if (subpage_test_bitmap_all_set(fs_info, subpage, uptodate)) + if (subpage_test_bitmap_all_set(fs_info, folio, uptodate)) folio_mark_uptodate(folio); spin_unlock_irqrestore(&subpage->lock, flags); } @@ -426,7 +445,7 @@ bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&subpage->lock, flags); bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - if (subpage_test_bitmap_all_zero(fs_info, subpage, dirty)) + if (subpage_test_bitmap_all_zero(fs_info, folio, dirty)) last = true; spin_unlock_irqrestore(&subpage->lock, flags); return last; @@ -484,7 +503,7 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&subpage->lock, flags); bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - if (subpage_test_bitmap_all_zero(fs_info, subpage, writeback)) { + if (subpage_test_bitmap_all_zero(fs_info, folio, writeback)) { ASSERT(folio_test_writeback(folio)); folio_end_writeback(folio); } @@ -515,7 +534,7 @@ void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&subpage->lock, flags); bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - if (subpage_test_bitmap_all_zero(fs_info, subpage, ordered)) + if (subpage_test_bitmap_all_zero(fs_info, folio, ordered)) folio_clear_ordered(folio); spin_unlock_irqrestore(&subpage->lock, flags); } @@ -530,7 +549,7 @@ void btrfs_subpage_set_checked(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&subpage->lock, flags); bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - if (subpage_test_bitmap_all_set(fs_info, subpage, checked)) + if (subpage_test_bitmap_all_set(fs_info, folio, checked)) folio_set_checked(folio); spin_unlock_irqrestore(&subpage->lock, flags); } @@ -652,26 +671,29 @@ IMPLEMENT_BTRFS_PAGE_OPS(ordered, folio_set_ordered, folio_clear_ordered, IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked, folio_test_checked);
-#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst) \ +#define GET_SUBPAGE_BITMAP(fs_info, folio, name, dst) \ { \ - const int sectors_per_page = fs_info->sectors_per_page; \ + const unsigned int blocks_per_folio = \ + btrfs_blocks_per_folio(fs_info, folio); \ + const struct btrfs_subpage *subpage = folio_get_private(folio); \ \ - ASSERT(sectors_per_page < BITS_PER_LONG); \ + ASSERT(blocks_per_folio < BITS_PER_LONG); \ *dst = bitmap_read(subpage->bitmaps, \ - sectors_per_page * btrfs_bitmap_nr_##name, \ - sectors_per_page); \ + blocks_per_folio * btrfs_bitmap_nr_##name, \ + blocks_per_folio); \ }
#define SUBPAGE_DUMP_BITMAP(fs_info, folio, name, start, len) \ { \ - const struct btrfs_subpage *subpage = folio_get_private(folio); \ unsigned long bitmap; \ + const unsigned int blocks_per_folio = \ + btrfs_blocks_per_folio(fs_info, folio); \ \ - GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap); \ + GET_SUBPAGE_BITMAP(fs_info, folio, name, &bitmap); \ btrfs_warn(fs_info, \ "dumpping bitmap start=%llu len=%u folio=%llu " #name "_bitmap=%*pbl", \ start, len, folio_pos(folio), \ - fs_info->sectors_per_page, &bitmap); \ + blocks_per_folio, &bitmap); \ }
/* @@ -738,7 +760,7 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info, } bitmap_set(subpage->bitmaps, start_bit, nbits); ret = atomic_add_return(nbits, &subpage->nr_locked); - ASSERT(ret <= fs_info->sectors_per_page); + ASSERT(ret <= btrfs_blocks_per_folio(fs_info, folio)); spin_unlock_irqrestore(&subpage->lock, flags); }
@@ -746,7 +768,7 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { struct btrfs_subpage *subpage; - const u32 sectors_per_page = fs_info->sectors_per_page; + const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); unsigned long uptodate_bitmap; unsigned long dirty_bitmap; unsigned long writeback_bitmap; @@ -756,28 +778,28 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info, unsigned long flags;
ASSERT(folio_test_private(folio) && folio_get_private(folio)); - ASSERT(sectors_per_page > 1); + ASSERT(blocks_per_folio > 1); subpage = folio_get_private(folio);
spin_lock_irqsave(&subpage->lock, flags); - GET_SUBPAGE_BITMAP(subpage, fs_info, uptodate, &uptodate_bitmap); - GET_SUBPAGE_BITMAP(subpage, fs_info, dirty, &dirty_bitmap); - GET_SUBPAGE_BITMAP(subpage, fs_info, writeback, &writeback_bitmap); - GET_SUBPAGE_BITMAP(subpage, fs_info, ordered, &ordered_bitmap); - GET_SUBPAGE_BITMAP(subpage, fs_info, checked, &checked_bitmap); - GET_SUBPAGE_BITMAP(subpage, fs_info, locked, &locked_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, uptodate, &uptodate_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, dirty, &dirty_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, writeback, &writeback_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, ordered, &ordered_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, checked, &checked_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, locked, &locked_bitmap); spin_unlock_irqrestore(&subpage->lock, flags);
dump_page(folio_page(folio, 0), "btrfs subpage dump"); btrfs_warn(fs_info, "start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl", start, len, folio_pos(folio), - sectors_per_page, &uptodate_bitmap, - sectors_per_page, &dirty_bitmap, - sectors_per_page, &locked_bitmap, - sectors_per_page, &writeback_bitmap, - sectors_per_page, &ordered_bitmap, - sectors_per_page, &checked_bitmap); + blocks_per_folio, &uptodate_bitmap, + blocks_per_folio, &dirty_bitmap, + blocks_per_folio, &locked_bitmap, + blocks_per_folio, &writeback_bitmap, + blocks_per_folio, &ordered_bitmap, + blocks_per_folio, &checked_bitmap); }
void btrfs_get_subpage_dirty_bitmap(struct btrfs_fs_info *fs_info, @@ -788,10 +810,10 @@ void btrfs_get_subpage_dirty_bitmap(struct btrfs_fs_info *fs_info, unsigned long flags;
ASSERT(folio_test_private(folio) && folio_get_private(folio)); - ASSERT(fs_info->sectors_per_page > 1); + ASSERT(btrfs_blocks_per_folio(fs_info, folio) > 1); subpage = folio_get_private(folio);
spin_lock_irqsave(&subpage->lock, flags); - GET_SUBPAGE_BITMAP(subpage, fs_info, dirty, ret_bitmap); + GET_SUBPAGE_BITMAP(fs_info, folio, dirty, ret_bitmap); spin_unlock_irqrestore(&subpage->lock, flags); }
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 18de34daa7c62c830be533aace6b7c271e8e95cf ]
While running test case btrfs/192 from fstests with support for large folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic btrfs check failures reporting that csum items were missing. Looking into the issue it turned out that btrfs check searches for csum items of a file extent item with a range that spans beyond the i_size of a file and we don't have any, because the kernel's writeback code skips submitting bios for ranges beyond eof. It's not expected however to find a file extent item that crosses the rounded up (by the sector size) i_size value, but there is a short time window where we can end up with a transaction commit leaving this small inconsistency between the i_size and the last file extent item.
Example btrfs check output when this happens:
$ btrfs check /dev/sdc Opening filesystem to check... Checking filesystem on /dev/sdc UUID: 69642c61-5efb-4367-aa31-cdfd4067f713 [1/8] checking log skipped (none written) [2/8] checking root items [3/8] checking extents [4/8] checking free space tree [5/8] checking fs roots root 5 inode 332 errors 1000, some csum missing ERROR: errors found in fs roots (...)
Looking at a tree dump of the fs tree (root 5) for inode 332 we have:
$ btrfs inspect-internal dump-tree -t 5 /dev/sdc (...) item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160 generation 17 transid 19 size 610969 nbytes 86016 block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0 sequence 11 flags 0x0(none) atime 1759851068.391327881 (2025-10-07 16:31:08) ctime 1759851068.410098267 (2025-10-07 16:31:08) mtime 1759851068.410098267 (2025-10-07 16:31:08) otime 1759851068.391327881 (2025-10-07 16:31:08) item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13 index 2 namelen 3 name: f1f item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53 generation 19 type 1 (regular) extent data disk byte 21745664 nr 65536 extent data offset 0 nr 65536 ram 65536 extent compression 0 (none) (...)
We can see that the file extent item for file offset 589824 has a length of 64K and its number of bytes is 64K. Looking at the inode item we see that its i_size is 610969 bytes which falls within the range of that file extent item [589824, 655360[.
Looking into the csum tree:
$ btrfs inspect-internal dump-tree /dev/sdc (...) item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200 range start 21565440 end 21770240 length 204800 item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8 range start 1104576512 end 1104584704 length 8192 (..)
We see that the csum item number 15 covers the first 24K of the file extent item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664, so we have:
21770240 - 21745664 = 24K
We see that the next csum item (number 16) is completely outside the range, so the remaining 40K of the extent doesn't have csum items in the tree.
If we round up the i_size to the sector size, we get:
round_up(610969, 4096) = 614400
If we subtract from that the file offset for the extent item we get:
614400 - 589824 = 24K
So the missing 40K corresponds to the end of the file extent item's range minus the rounded up i_size:
655360 - 614400 = 40K
Normally we don't expect a file extent item to span over the rounded up i_size of an inode, since when truncating, doing hole punching and other operations that trim a file extent item, the number of bytes is adjusted.
There is however a short time window where the kernel can end up, temporarily,persisting an inode with an i_size that falls in the middle of the last file extent item and the file extent item was not yet trimmed (its number of bytes reduced so that it doesn't cross i_size rounded up by the sector size).
The steps (in the kernel) that lead to such scenario are the following:
1) We have inode I as an empty file, no allocated extents, i_size is 0;
2) A buffered write is done for file range [589824, 655360[ (length of 64K) and the i_size is updated to 655360. Note that we got a single large folio for the range (64K);
3) A truncate operation starts that reduces the inode's i_size down to 610969 bytes. The truncate sets the inode's new i_size at btrfs_setsize() by calling truncate_setsize() and before calling btrfs_truncate();
4) At btrfs_truncate() we trigger writeback for the range starting at 610304 (which is the new i_size rounded down to the sector size) and ending at (u64)-1;
5) During the writeback, at extent_write_cache_pages(), we get from the call to filemap_get_folios_tag(), the 64K folio that starts at file offset 589824 since it contains the start offset of the writeback range (610304);
6) At writepage_delalloc() we find the whole range of the folio is dirty and therefore we run delalloc for that 64K range ([589824, 655360[), reserving a 64K extent, creating an ordered extent, etc;
7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[ because the inode's i_size is 610969 bytes (rounded up by sector size is 614400). There, in the while loop we intentionally skip IO beyond i_size to avoid any unnecessay work and just call btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which has a 40K length);
8) Once the IO finishes we finish the ordered extent by ending up at btrfs_finish_one_ordered(), join transaction N, insert a file extent item in the inode's subvolume tree for file offset 589824 with a number of bytes of 64K, and update the inode's delayed inode item or directly the inode item with a call to btrfs_update_inode_fallback(), which results in storing the new i_size of 610969 bytes;
9) Transaction N is committed either by the transaction kthread or some other task committed it (in response to a sync or fsync for example).
At this point we have inode I persisted with an i_size of 610969 bytes and file extent item that starts at file offset 589824 and has a number of bytes of 64K, ending at an offset of 655360 which is beyond the i_size rounded up to the sector size (614400).
--> So after a crash or power failure here, the btrfs check program reports that error about missing checksum items for this inode, as it tries to lookup for checksums covering the whole range of the extent;
10) Only after transaction N is committed that at btrfs_truncate() the call to btrfs_start_transaction() starts a new transaction, N + 1, instead of joining transaction N. And it's with transaction N + 1 that it calls btrfs_truncate_inode_items() which updates the file extent item at file offset 589824 to reduce its number of bytes from 64K down to 24K, so that the file extent item's range ends at the i_size rounded up to the sector size (614400 bytes).
Fix this by truncating the ordered extent at extent_writepage_io() when we skip writeback because the current offset in the folio is beyond i_size. This ensures we don't ever persist a file extent item with a number of bytes beyond the rounded up (by sector size) value of the i_size.
Reviewed-by: Qu Wenruo wqu@suse.com Reviewed-by: Anand Jain asj@kernel.org Signed-off-by: Filipe Manana fdmanana@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 21 +++++++++++++++++++-- fs/btrfs/ordered-data.c | 5 +++-- 2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1ca890f02da91..3658b74a97adb 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1499,13 +1499,13 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, bool submitted_io = false; int found_error = 0; const u64 folio_start = folio_pos(folio); + const u64 folio_end = folio_start + folio_size(folio); const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); u64 cur; int bit; int ret = 0;
- ASSERT(start >= folio_start && - start + len <= folio_start + folio_size(folio)); + ASSERT(start >= folio_start && start + len <= folio_end);
ret = btrfs_writepage_cow_fixup(folio); if (ret) { @@ -1526,6 +1526,23 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
if (cur >= i_size) { + struct btrfs_ordered_extent *ordered; + unsigned long flags; + + ordered = btrfs_lookup_first_ordered_range(inode, cur, + folio_end - cur); + /* + * We have just run delalloc before getting here, so + * there must be an ordered extent. + */ + ASSERT(ordered != NULL); + spin_lock_irqsave(&inode->ordered_tree_lock, flags); + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); + ordered->truncated_len = min(ordered->truncated_len, + cur - ordered->file_offset); + spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); + btrfs_put_ordered_extent(ordered); + btrfs_mark_ordered_io_finished(inode, folio, cur, start + len - cur, true); /* diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 880f9553d79d3..6ac254a529073 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1080,8 +1080,9 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range( struct rb_node *prev; struct rb_node *next; struct btrfs_ordered_extent *entry = NULL; + unsigned long flags;
- spin_lock_irq(&inode->ordered_tree_lock); + spin_lock_irqsave(&inode->ordered_tree_lock, flags); node = inode->ordered_tree.rb_node; /* * Here we don't want to use tree_search() which will use tree->last @@ -1136,7 +1137,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range( trace_btrfs_ordered_extent_lookup_first_range(inode, entry); }
- spin_unlock_irq(&inode->ordered_tree_lock); + spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); return entry; }
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 46a23908598f4b8e61483f04ea9f471b2affc58a ]
Instead of repeating the expression "start + len" multiple times, store it in a variable and use it where needed.
Reviewed-by: Qu Wenruo wqu@suse.com Reviewed-by: Anand Jain asj@kernel.org Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3658b74a97adb..657c4652f8b48 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1498,6 +1498,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, unsigned long range_bitmap = 0; bool submitted_io = false; int found_error = 0; + const u64 end = start + len; const u64 folio_start = folio_pos(folio); const u64 folio_end = folio_start + folio_size(folio); const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); @@ -1505,7 +1506,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, int bit; int ret = 0;
- ASSERT(start >= folio_start && start + len <= folio_end); + ASSERT(start >= folio_start && end <= folio_end);
ret = btrfs_writepage_cow_fixup(folio); if (ret) { @@ -1515,7 +1516,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, return 1; }
- for (cur = start; cur < start + len; cur += fs_info->sectorsize) + for (cur = start; cur < end; cur += fs_info->sectorsize) set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap); bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap, blocks_per_folio); @@ -1544,7 +1545,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, btrfs_put_ordered_extent(ordered);
btrfs_mark_ordered_io_finished(inode, folio, cur, - start + len - cur, true); + end - cur, true); /* * This range is beyond i_size, thus we don't need to * bother writing back. @@ -1553,8 +1554,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, * writeback the sectors with subpage dirty bits, * causing writeback without ordered extent. */ - btrfs_folio_clear_dirty(fs_info, folio, cur, - start + len - cur); + btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); break; } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
From: Qu Wenruo wqu@suse.com
[ Upstream commit e9e3b22ddfa760762b696ac6417c8d6edd182e49 ]
[BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt
This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
Note for file offset 60K, we're inserting a file extent without any data checksum.
Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE] The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K |/////////////| |//| |//|
- Truncate the inode to 16K Which will trigger writeback through:
btrfs_setsize() |- truncate_setsize() | Now the inode size is set to 16K | |- btrfs_truncate() |- btrfs_wait_ordered_range() for [16K, u64(-1)] |- btrfs_fdatawrite_range() for [16K, u64(-1)} |- extent_writepage() for folio 0 |- writepage_delalloc() | Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) | |- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled differently:
- Submit write for range [0, 16K) As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum.
- Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range.
[FIX] The bug itself was introduced a long time ago, way before subpage and large folio support.
At that time, fs block size must match page size, thus the range [cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Qu Wenruo wqu@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 657c4652f8b48..1e855c5854ce5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1531,7 +1531,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, unsigned long flags;
ordered = btrfs_lookup_first_ordered_range(inode, cur, - folio_end - cur); + fs_info->sectorsize); /* * We have just run delalloc before getting here, so * there must be an ordered extent. @@ -1545,7 +1545,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, btrfs_put_ordered_extent(ordered);
btrfs_mark_ordered_io_finished(inode, folio, cur, - end - cur, true); + fs_info->sectorsize, true); /* * This range is beyond i_size, thus we don't need to * bother writing back. @@ -1554,8 +1554,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, * writeback the sectors with subpage dirty bits, * causing writeback without ordered extent. */ - btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); - break; + btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize); + continue; } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); if (unlikely(ret < 0)) {
linux-stable-mirror@lists.linaro.org