Re: [PATCH 1/2] btrfs: fix beyond-EOF write handling

11 Dec 2025

在 2025/12/11 21:56, Filipe Manana 写道:
...
On Thu, Dec 11, 2025 at 2:15 AM Qu Wenruo wqu@suse.com wrote:
...
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null
   mount $dev $mnt
   xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
             -c "truncate 16k" $mnt/foobar
   umount $mnt
If you can, add this to fstests.
The problem is, it's very hard to cause any observable result.
For most if not all cases, the file extent item insertion and truncation 
happen in the same transaction thus no error can be detected.
We need either a way to force all btrfs_end_transaction() to commit the 
trans, or very heavy workload to trigger enough transactions, then along 
with log-replay to detect such problem.
Thanks,
Qu
...
...
This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384


Note for file offset 60K, we're inserting an file extent without any
an file -> a file
...
data checksum.
Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE]
The sequence happens like this:

Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0             16K             32K              48K           64K
 |/////////////|               |//|                        |//|

Truncate the inode to 16K
 Which will trigger writeback through:
btrfs_setsize()
 |- truncate_setsize()
 |  Now the inode size is set to 16K
 |
 |- btrfs_truncacte()


truncacte -> truncate
It looks good otherwise, so:
Reviewed-by: Filipe Manana fdmanana@suse.com
Thanks.
...
  |- btrfs_wait_ordered_range() for [16K, u64(-1)]
     |- btrfs_fdatawrite_range() for [16K, u64(-1)}
        |- extent_writepage() for folio 0
           |- writepage_delalloc()
           |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
           |
           |- extent_writepage_io()


Then inside extent_writepage_io(), the dirty fs blocks are handled
   differently:

Submit write for range [0, 16K)
As they are still inside the inode size (16K).

Mark OE [32K, 36K) as truncated
Since we only call btrfs_lookup_first_ordered_range() once, which
returned the first OE after file offset 16K.

Mark all OEs inside range [16K, 64K) as finished
Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its
truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data
checksum, and we insert the file extent as usual.
This is the root cause of file extent at 60K to be inserted without
any data checksum.

Clear dirty flags for range [16K, 64K)
It is the function btrfs_folio_clear_dirty() which search and clear
any dirty blocks inside that range.


[FIX]
The bug itself is introduced a long time ago, way before subpage and
larger folio support.
At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.
But later with subpage and larger folios, the same range [cur, end)
can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") is fixing a bug related to subpage/larger
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo wqu@suse.com

fs/btrfs/extent_io.c | 8 ++++----
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 629fd5af4286..a4b74023618d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
                         struct btrfs_ordered_extent *ordered;
                     ordered = btrfs_lookup_first_ordered_range(inode, cur,


                                                             folio_end - cur);




                                                             fs_info->sectorsize);
                   /*
                    * We have just run delalloc before getting here, so
                    * there must be an ordered extent.



@@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
                         btrfs_put_ordered_extent(ordered);
                     btrfs_mark_ordered_io_finished(inode, folio, cur,


                                                 end - cur, true);




                                                 fs_info->sectorsize, true);
                   /*
                    * This range is beyond i_size, thus we don't need to
                    * bother writing back.



@@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
                          * writeback the sectors with subpage dirty bits,
                          * causing writeback without ordered extent.
                          */

                  btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);


                  break;




                  btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);


                  continue;
           }
           ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
           if (unlikely(ret < 0)) {



--
2.52.0

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 1/2] btrfs: fix beyond-EOF write handling