[BUG] There are several double accounting case, where the WARN_ON_ONCE() is triggered inside can_finish_ordered_extent().
And all such cases points back to the btrfs_mark_ordered_io_finished() call inside extent_writepage() when it hits some error.
[CAUSE] With extra debug patches to show where the error is from, it turns out to be btrfs_run_delalloc_range() can fail with -ENOSPC.
Such failure itself is already a symptom of some bad data/metadata space reservation, but here we need to focus on the error handling part.
For example, we have the following dirty page layout (4K sector size and 4K page size):
0 16K 32K |/////|/////|/////|/////|/////|/////|/////|/////|
Where the range [0, 32K) is dirty and we need to write all the 8 pages back.
When handling the first page 0, we go the following sequence:
- btrfs_run_delalloc_range() for range [0, 32k) We enter cow_file_range() for [0, 32K)
- btrfs_reserve_extent() only returned a 16K data extent. This can be caused by fragmentation, and it's already an indication we're almost running of space.
Now we have the following layout:
0 16K 32K |<----- Reserved ------>|/////|/////|/////|/////|
The range [0, 16K) has ordered extent allocated.
- btrfs_reserve_extent() returned -ENOSPC We really run out of space. But since we have reserved space for range [0, 16K) we need to clean them up.
But that cleanup for ordered extent only happens inside btrfs_run_delalloc_range().
- btrfs_run_delalloc_range() cleanup the reserved ordered extent By calling btrfs_mark_ordered_io_finished() for range [0, 32K).
It will locate the ordered extent [0, 16K) and mark it as IOERR. Also since the ordered extent is only 16K, we're finishing the whole ordered extent.
Thus we call btrfs_queue_ordered_fn() to queue to finish the ordered extent. But still, the ordered extent [0, 16K) is still in the btrfs_inode::ordered_tree.
- extent_writepage() cleanup the ordered extent inside the folio We call btrfs_mark_ordered_io_finished() for range [0, 4K).
Since the finished ordered extent [0, 16K) is not yet removed (racy, depends on when btrfs_finish_one_ordered() is called), if btrfs_mark_ordered_io_finished() is called before btrfs_finish_one_ordered(), we will double account and trigger the warning inside can_finish_ordered_extent().
So the root cause is, we're relying on btrfs_mark_ordered_io_finished() to handle ranges which is already cleaned up.
Unfortunately the bug dates back to the early days when btrfs_mark_ordered_io_finished() is introduced as a no-brain choice for error paths, but such no-brain solution just hides all the race and make us less cautious when handling errors.
[FIX] Instead of relying on the btrfs_mark_ordered_io_finished() call to cleanup the whole folio range, record the last successfully ran delalloc range.
And combined with bio_ctrl->submit_bitmap to properly clean up any newly created ordered extents.
Since we have cleaned up the ordered extents in range, we should not rely on the btrfs_mark_ordered_io_finished() inside extent_writepage() anymore.
By this, we ensure btrfs_mark_ordered_io_finished() is only called once when writepage_delalloc() failed.
Cc: stable@vger.kernel.org # 5.15+ Fixes: e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions") Signed-off-by: Qu Wenruo wqu@suse.com --- fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9725ff7f274d..417c710c55ca 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, * last delalloc end. */ u64 last_delalloc_end = 0; + /* + * Save the last successfully ran delalloc range end (exclusive). + * This is for error handling to avoid ranges with ordered extent created + * but no IO will be submitted due to error. + */ + u64 last_finished = page_start; u64 delalloc_start = page_start; u64 delalloc_end = page_end; u64 delalloc_to_write = 0; @@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, found_len = last_delalloc_end + 1 - found_start;
if (ret >= 0) { + /* + * Some delalloc range may be created by previous folios. + * Thus we still need to clean those range up during error + * handling. + */ + last_finished = found_start; /* No errors hit so far, run the current delalloc range. */ ret = btrfs_run_delalloc_range(inode, folio, found_start, found_start + found_len - 1, wbc); + if (ret >= 0) + last_finished = found_start + found_len; } else { /* * We've hit an error during previous delalloc range, @@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
delalloc_start = found_start + found_len; } - if (ret < 0) + /* + * It's possible we have some ordered extents created before we hit + * an error, cleanup non-async successfully created delalloc ranges. + */ + if (unlikely(ret < 0)) { + unsigned int bitmap_size = min( + (last_finished - page_start) >> fs_info->sectorsize_bits, + fs_info->sectors_per_page); + + for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size) + btrfs_mark_ordered_io_finished(inode, folio, + page_start + (bit << fs_info->sectorsize_bits), + fs_info->sectorsize, false); return ret; + } out: if (last_delalloc_end) delalloc_end = last_delalloc_end; @@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
bio_ctrl->wbc->nr_to_write--;
-done: - if (ret) { + if (ret) btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio, page_start, PAGE_SIZE, !ret); - mapping_set_error(folio->mapping, ret); - }
+done: + if (ret < 0) + mapping_set_error(folio->mapping, ret); /* * Only unlock ranges that are submitted. As there can be some async * submitted ranges inside the folio.