在 2025/12/11 21:56, Filipe Manana 写道:
On Thu, Dec 11, 2025 at 2:15 AM Qu Wenruo wqu@suse.com wrote:
[BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt
If you can, add this to fstests.
The problem is, it's very hard to cause any observable result.
For most if not all cases, the file extent item insertion and truncation happen in the same transaction thus no error can be detected.
We need either a way to force all btrfs_end_transaction() to commit the trans, or very heavy workload to trigger enough transactions, then along with log-replay to detect such problem.
Thanks, Qu
This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384Note for file offset 60K, we're inserting an file extent without any
an file -> a file
data checksum.
Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE] The sequence happens like this:
Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K |/////////////| |//| |//|
Truncate the inode to 16K Which will trigger writeback through:
btrfs_setsize() |- truncate_setsize() | Now the inode size is set to 16K | |- btrfs_truncacte()
truncacte -> truncate
It looks good otherwise, so:
Reviewed-by: Filipe Manana fdmanana@suse.com
Thanks.
|- btrfs_wait_ordered_range() for [16K, u64(-1)] |- btrfs_fdatawrite_range() for [16K, u64(-1)} |- extent_writepage() for folio 0 |- writepage_delalloc() | Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) | |- extent_writepage_io()Then inside extent_writepage_io(), the dirty fs blocks are handled differently:
Submit write for range [0, 16K) As they are still inside the inode size (16K).
Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K.
Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum.
Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which search and clear any dirty blocks inside that range.
[FIX] The bug itself is introduced a long time ago, way before subpage and larger folio support.
At that time, fs block size must match page size, thus the range [cur, end) is just one fs block.
But later with subpage and larger folios, the same range [cur, end) can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") is fixing a bug related to subpage/larger folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo wqu@suse.com
fs/btrfs/extent_io.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 629fd5af4286..a4b74023618d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, struct btrfs_ordered_extent *ordered;
ordered = btrfs_lookup_first_ordered_range(inode, cur,
folio_end - cur);
fs_info->sectorsize); /* * We have just run delalloc before getting here, so * there must be an ordered extent.@@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, btrfs_put_ordered_extent(ordered);
btrfs_mark_ordered_io_finished(inode, folio, cur,
end - cur, true);
fs_info->sectorsize, true); /* * This range is beyond i_size, thus we don't need to * bother writing back.@@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, * writeback the sectors with subpage dirty bits, * causing writeback without ordered extent. */
btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);break;
btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);continue; } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); if (unlikely(ret < 0)) {-- 2.52.0