[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modification before that error, and they are still in the btree inode page cache.
Since there will be no real transaction commitment, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state immediately mark the bbio as failure, instead of really submitting them into the device.
This means for test case like generic/388, at iput() those dirty folios will just be discarded without triggering IO.
CC: stable@vger.kernel.org Signed-off-by: Qu Wenruo wqu@suse.com --- fs/btrfs/disk-io.c | 12 ++++++++++++ fs/btrfs/extent_io.c | 11 +++++++++++ 2 files changed, 23 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0aa7e5d1b05f..8b0fc2df85f1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4402,11 +4402,23 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
btrfs_put_block_group_cache(fs_info);
+ /* + * If the fs is already in trans aborted case, also trigger writeback of all + * dirty metadata folios. + * Those folios will not reach disk but dicarded directly. + * This is to make sure no dirty folios before iput(), or iput() will + * trigger writeback again, and may even cause extra works queued + * into workqueue. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) + filemap_write_and_wait(fs_info->btree_inode->i_mapping); + /* * we must make sure there is not any read request to * submit after we stopping all workers. */ invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + btrfs_stop_all_workers(fs_info);
/* We shouldn't have any transaction open at this point */ diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 870584dde575..a8a53409bb3f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,6 +2246,17 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); } + /* + * If the fs is already in error status, do not submit any writeback + * but immediately finish it. + * This is to avoid iput() triggering dirty folio writeback for + * transaction aborted fses, which can cause extra works into + * already stopped workqueues. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) { + btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS)); + return; + } btrfs_submit_bbio(bbio, 0); }
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modification before that error, and they are
modification -> modifications
still in the btree inode page cache.
Since there will be no real transaction commitment, all those dirty
commitment -> commit
folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state immediately mark the bbio as failure, instead of really submitting them into the device.
Ok, so you are describing half of the fix here, but no motion of the flushing done in close_ctree().
This means for test case like generic/388, at iput() those dirty folios will just be discarded without triggering IO.
CC: stable@vger.kernel.org
Can we get a minimum version here or a Fixes tag?
Signed-off-by: Qu Wenruo wqu@suse.com
fs/btrfs/disk-io.c | 12 ++++++++++++ fs/btrfs/extent_io.c | 11 +++++++++++ 2 files changed, 23 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0aa7e5d1b05f..8b0fc2df85f1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4402,11 +4402,23 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
btrfs_put_block_group_cache(fs_info);
/** If the fs is already in trans aborted case, also trigger writeback of all* dirty metadata folios.* Those folios will not reach disk but dicarded directly.
dicarded -> discarded
* This is to make sure no dirty folios before iput(), or iput() will* trigger writeback again, and may even cause extra works queued* into workqueue.
A bit convoluted and confusing. How about:
"This is to make sure the final iput on the btree inode, done after stopping and freeing all workqueues, does not trigger writeback of dirty extent buffers and attempt to queue jobs, which would result in a use-after-free against a workqueue."
*/if (unlikely(BTRFS_FS_ERROR(fs_info)))filemap_write_and_wait(fs_info->btree_inode->i_mapping);
So two suggestions:
1) Move this into btrfs_error_commit_super(), to have all code related to fs in error state in one single place.
2) Instead of of calling filemap_write_and_wait(), make this simpler by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
/* * we must make sure there is not any read request to * submit after we stopping all workers. */ invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
Stray new line here.
btrfs_stop_all_workers(fs_info); /* We shouldn't have any transaction open at this point */diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 870584dde575..a8a53409bb3f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,6 +2246,17 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); }
/** If the fs is already in error status, do not submit any writeback* but immediately finish it.* This is to avoid iput() triggering dirty folio writeback for* transaction aborted fses, which can cause extra works into* already stopped workqueues.
Confusing as before. Better say it's to avoid use-after-tree on the workqueues triggered by the final iput() of the btree inode, which happens after stopping and freeing all workqueues.
The subject is also kind of too long. Something much shorter which says the same:
"btrfs: ensure no dirty metadata before stopping work queues during unmount"
Thanks.
*/if (unlikely(BTRFS_FS_ERROR(fs_info))) {btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS));return;} btrfs_submit_bbio(bbio, 0);}
-- 2.51.0
在 2025/10/24 01:56, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[...]
So two suggestions:
- Move this into btrfs_error_commit_super(), to have all code related
to fs in error state in one single place.
- Instead of of calling filemap_write_and_wait(), make this simpler
by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
Yes, that's the original workaround I went with, the problem is we're still submitting metadata writes after a trans abort.
I don't feel that comfort writing back metadata in that situation. Maybe the trans abort is triggered because a corrupted extent/free space tree which allows us to allocate new tree blocks where we shouldn't (aka, metadata COW is already broken).
Thus I consider delaying btrfs_stop_all_workers() until iput() is only a workaround, it still allows us to submit unnecessary writes.
I'd prefer the solution 1) in this case, still with the extra handling in write_one_eb().
Thanks for the review and suggestion, will follow the advice of the remaining part.
Thanks, Qu
On Thu, Oct 23, 2025 at 9:44 PM Qu Wenruo wqu@suse.com wrote:
在 2025/10/24 01:56, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[...]
So two suggestions:
- Move this into btrfs_error_commit_super(), to have all code related
to fs in error state in one single place.
- Instead of of calling filemap_write_and_wait(), make this simpler
by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
Yes, that's the original workaround I went with, the problem is we're still submitting metadata writes after a trans abort.
I don't feel that comfort writing back metadata in that situation. Maybe the trans abort is triggered because a corrupted extent/free space tree which allows us to allocate new tree blocks where we shouldn't (aka, metadata COW is already broken).
Metadata COW is broken why??
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them. As long as we don't allow the transaction to be committed and new transactions to be started, we are safe - in fact that's the only thing a transaction abort guarantees.
We may have many places that could check if a transaction was aborted and avoid extent allocation and writeback, but that's ok, as long as we don't allow transaction commits.
Thus I consider delaying btrfs_stop_all_workers() until iput() is only a workaround, it still allows us to submit unnecessary writes.
I'd prefer the solution 1) in this case, still with the extra handling in write_one_eb().
Thanks for the review and suggestion, will follow the advice of the remaining part.
Thanks, Qu
在 2025/10/24 18:37, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 9:44 PM Qu Wenruo wqu@suse.com wrote:
在 2025/10/24 01:56, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[...]
So two suggestions:
- Move this into btrfs_error_commit_super(), to have all code related
to fs in error state in one single place.
- Instead of of calling filemap_write_and_wait(), make this simpler
by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
Yes, that's the original workaround I went with, the problem is we're still submitting metadata writes after a trans abort.
I don't feel that comfort writing back metadata in that situation. Maybe the trans abort is triggered because a corrupted extent/free space tree which allows us to allocate new tree blocks where we shouldn't (aka, metadata COW is already broken).
Metadata COW is broken why??
Consider this situation, no free space cache, and extent tree by somehow lacks the backref item for tree block A.
Then a new transaction is started, allocator choose the bytenr of tree block A for a new tree block B.
At this stage, tree block B will overwrite tree block A, but no writeback yet. And at this time transaction is not yet aborted.
Then by somehoe the tree block A got its reference dropped by one (e.g. subvolume deletion), but extent tree is corrupted and failed to find the backref item, and transaction is aborted.
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them.
Writeback can be triggered by a lot of other reasons, e.g. memory pressure, and in that case if we try to writeback tree block B, it will overwrite tree block A even if it's after transaction abort.
Not to mention the final iput() in close_ctree().
Thanks, Qu
As long as we don't allow the transaction to be committed and new transactions to be started, we are safe - in fact that's the only thing a transaction abort guarantees.
We may have many places that could check if a transaction was aborted and avoid extent allocation and writeback, but that's ok, as long as we don't allow transaction commits.
Thus I consider delaying btrfs_stop_all_workers() until iput() is only a workaround, it still allows us to submit unnecessary writes.
I'd prefer the solution 1) in this case, still with the extra handling in write_one_eb().
Thanks for the review and suggestion, will follow the advice of the remaining part.
Thanks, Qu
On Fri, Oct 24, 2025 at 9:37 AM Qu Wenruo quwenruo.btrfs@gmx.com wrote:
在 2025/10/24 18:37, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 9:44 PM Qu Wenruo wqu@suse.com wrote:
在 2025/10/24 01:56, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[...]
So two suggestions:
- Move this into btrfs_error_commit_super(), to have all code related
to fs in error state in one single place.
- Instead of of calling filemap_write_and_wait(), make this simpler
by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
Yes, that's the original workaround I went with, the problem is we're still submitting metadata writes after a trans abort.
I don't feel that comfort writing back metadata in that situation. Maybe the trans abort is triggered because a corrupted extent/free space tree which allows us to allocate new tree blocks where we shouldn't (aka, metadata COW is already broken).
Metadata COW is broken why??
Consider this situation, no free space cache, and extent tree by somehow lacks the backref item for tree block A.
If there's no backref in the extent tree and no delayed ref for tree block A then we already have a corrupted fs.
Then a new transaction is started, allocator choose the bytenr of tree block A for a new tree block B.
At this stage, tree block B will overwrite tree block A, but no writeback yet. And at this time transaction is not yet aborted.
Then by somehoe the tree block A got its reference dropped by one (e.g. subvolume deletion), but extent tree is corrupted and failed to find the backref item, and transaction is aborted.
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them.
Writeback can be triggered by a lot of other reasons, e.g. memory pressure, and in that case if we try to writeback tree block B, it will overwrite tree block A even if it's after transaction abort.
It doesn't matter why and when the writeback is triggered, that was not my point. My point was if a transaction was aborted, what matters is that we can't commit that transaction or start a new one, and can't override existing extents (either they are used or pinned).
Not to mention the final iput() in close_ctree().
By the time of the final iput() we are not supposed to have dirty metadata, we have already committed any open transaction. The exception was here for transaction abort case, but other than the use-after-free, it can't cause any damage to the fs.
Thanks, Qu
As long as we don't allow the transaction to be committed and new transactions to be started, we are safe - in fact that's the only thing a transaction abort guarantees.
We may have many places that could check if a transaction was aborted and avoid extent allocation and writeback, but that's ok, as long as we don't allow transaction commits.
Thus I consider delaying btrfs_stop_all_workers() until iput() is only a workaround, it still allows us to submit unnecessary writes.
I'd prefer the solution 1) in this case, still with the extra handling in write_one_eb().
Thanks for the review and suggestion, will follow the advice of the remaining part.
Thanks, Qu
在 2025/10/24 19:25, Filipe Manana 写道: [...]
If there's no backref in the extent tree and no delayed ref for tree block A then we already have a corrupted fs.
Yes I know.
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them.
Writeback can be triggered by a lot of other reasons, e.g. memory pressure, and in that case if we try to writeback tree block B, it will overwrite tree block A even if it's after transaction abort.
It doesn't matter why and when the writeback is triggered, that was not my point. My point was if a transaction was aborted, what matters is that we can't commit that transaction or start a new one, and can't override existing extents (either they are used or pinned).
My point is, your last point (can't override existing extents) is not guaranteed especially after a transaction is aborted, which can be caused by corrupted fs.
Yes, the initial fs can be corrupted, but it doesn't mean we're free to make it more corrupted.
Thanks, Qu
On Fri, Oct 24, 2025 at 10:10 AM Qu Wenruo quwenruo.btrfs@gmx.com wrote:
在 2025/10/24 19:25, Filipe Manana 写道: [...]
If there's no backref in the extent tree and no delayed ref for tree block A then we already have a corrupted fs.
Yes I know.
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them.
Writeback can be triggered by a lot of other reasons, e.g. memory pressure, and in that case if we try to writeback tree block B, it will overwrite tree block A even if it's after transaction abort.
It doesn't matter why and when the writeback is triggered, that was not my point. My point was if a transaction was aborted, what matters is that we can't commit that transaction or start a new one, and can't override existing extents (either they are used or pinned).
My point is, your last point (can't override existing extents) is not guaranteed especially after a transaction is aborted, which can be caused by corrupted fs.
If there's no backref for tree block A, or a delayed ref, then we already have a high risk of overwriting tree block A. The transaction abort is irrelevant... if there's no abort, we overwrite it (and silently) - so we're in a more bad situation if an abort happened.
We can never guarantee a fs is healthy if it's already corrupted.
Yes, the initial fs can be corrupted, but it doesn't mean we're free to make it more corrupted.
Thanks, Qu
linux-stable-mirror@lists.linaro.org