[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modifications before that error, and they are still in the btree inode page cache.
Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them.
Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues.
The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs.
CC: stable@vger.kernel.org #6.6 Signed-off-by: Qu Wenruo wqu@suse.com --- Changelog v2: - Various grammar and newline fixes
- A shorter title line
- Enhance the [FIX] part, explain the full fix
- Limit the backport for 6.6 v6.1 code base is very different compared to the current one, thus backporting to v6.6 would be the limit.
- Explain more why discarding bios at write_one_eb() is safer
- Remove the extra flushing part inside close_ctree() There is no difference flushing the dirty folios manually or by iput(), as dirty folios are discarded anyway, no new job will be created. --- fs/btrfs/extent_io.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 870584dde575..8f6b8baba003 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,6 +2246,14 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); } + /* + * If the fs is already in error status, do not submit any writeback + * but immediately finish it. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) { + btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS)); + return; + } btrfs_submit_bbio(bbio, 0); }
On Fri, Oct 24, 2025 at 12:42 AM Qu Wenruo wqu@suse.com wrote:
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modifications before that error, and they are still in the btree inode page cache.
Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them.
Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues.
The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs.
CC: stable@vger.kernel.org #6.6
The correct syntax is:
stable@vger.kernel.org # 6.6+
Signed-off-by: Qu Wenruo wqu@suse.com
Changelog v2:
Various grammar and newline fixes
A shorter title line
Enhance the [FIX] part, explain the full fix
Limit the backport for 6.6 v6.1 code base is very different compared to the current one, thus backporting to v6.6 would be the limit.
Explain more why discarding bios at write_one_eb() is safer
Remove the extra flushing part inside close_ctree() There is no difference flushing the dirty folios manually or by iput(), as dirty folios are discarded anyway, no new job will be created.
fs/btrfs/extent_io.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 870584dde575..8f6b8baba003 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,6 +2246,14 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); }
/** If the fs is already in error status, do not submit any writeback* but immediately finish it.*/if (unlikely(BTRFS_FS_ERROR(fs_info))) {btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS));
Btw, BTRFS_FS_ERROR() returns the error that has caused the transaction to abort. So instead of -EROFS we can pass BTRFS_FS_ERROR(fs_info).
In the end it doesn't make any difference since the error is not returned to userspace.
Reviewed-by: Filipe Manana fdmanana@suse.com
return;} btrfs_submit_bbio(bbio, 0);}
-- 2.51.0
linux-stable-mirror@lists.linaro.org