The patch below does not apply to the 6.16-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.16.y git checkout FETCH_HEAD git cherry-pick -x b1511360c8ac882b0c52caa263620538e8d73220 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2025082101-survive-mannish-1c90@gregkh' --subject-prefix 'PATCH 6.16.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b1511360c8ac882b0c52caa263620538e8d73220 Mon Sep 17 00:00:00 2001 From: Naohiro Aota naohiro.aota@wdc.com Date: Thu, 31 Jul 2025 12:46:56 +0900 Subject: [PATCH] btrfs: subpage: keep TOWRITE tag until folio is cleaned
btrfs_subpage_set_writeback() calls folio_start_writeback() the first time a folio is written back, and it also clears the PAGECACHE_TAG_TOWRITE tag even if there are still dirty blocks in the folio. This can break ordering guarantees, such as those required by btrfs_wait_ordered_extents().
That ordering breakage leads to a real failure. For example, running generic/464 on a zoned setup will hit the following ASSERT. This happens because the broken ordering fails to flush existing dirty pages before the file size is truncated.
assertion failed: !list_empty(&ordered->list) :: 0, in fs/btrfs/zoned.c:1899 ------------[ cut here ]------------ kernel BUG at fs/btrfs/zoned.c:1899! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 1906169 Comm: kworker/u130:2 Kdump: loaded Not tainted 6.16.0-rc6-BTRFS-ZNS+ #554 PREEMPT(voluntary) Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_finish_ordered_zoned.cold+0x50/0x52 [btrfs] RSP: 0018:ffffc9002efdbd60 EFLAGS: 00010246 RAX: 000000000000004c RBX: ffff88811923c4e0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff827e38b1 RDI: 00000000ffffffff RBP: ffff88810005d000 R08: 00000000ffffdfff R09: ffffffff831051c8 R10: ffffffff83055220 R11: 0000000000000000 R12: ffff8881c2458c00 R13: ffff88811923c540 R14: ffff88811923c5e8 R15: ffff8881c1bd9680 FS: 0000000000000000(0000) GS:ffff88a04acd0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f907c7a918c CR3: 0000000004024000 CR4: 0000000000350ef0 Call Trace: <TASK> ? srso_return_thunk+0x5/0x5f btrfs_finish_ordered_io+0x4a/0x60 [btrfs] btrfs_work_helper+0xf9/0x490 [btrfs] process_one_work+0x204/0x590 ? srso_return_thunk+0x5/0x5f worker_thread+0x1d6/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x118/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x205/0x260 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK>
Consider process A calling writepages() with WB_SYNC_NONE. In zoned mode or for compressed writes, it locks several folios for delalloc and starts writing them out. Let's call the last locked folio folio X. Suppose the write range only partially covers folio X, leaving some pages dirty. Process A calls btrfs_subpage_set_writeback() when building a bio. This function call clears the TOWRITE tag of folio X, whose size = 8K and the block size = 4K. It is following state.
0 4K 8K |/////|/////| (flag: DIRTY, tag: DIRTY) <-----> Process A will write this range.
Now suppose process B concurrently calls writepages() with WB_SYNC_ALL. It calls tag_pages_for_writeback() to tag dirty folios with PAGECACHE_TAG_TOWRITE. Since folio X is still dirty, it gets tagged. Then, B collects tagged folios using filemap_get_folios_tag() and must wait for folio X to be written before returning from writepages().
0 4K 8K |/////|/////| (flag: DIRTY, tag: DIRTY|TOWRITE)
However, between tagging and collecting, process A may call btrfs_subpage_set_writeback() and clear folio X's TOWRITE tag. 0 4K 8K | |/////| (flag: DIRTY|WRITEBACK, tag: DIRTY)
As a result, process B won't see folio X in its batch, and returns without waiting for it. This breaks the WB_SYNC_ALL ordering requirement.
Fix this by using btrfs_subpage_set_writeback_keepwrite(), which retains the TOWRITE tag. We now manually clear the tag only after the folio becomes clean, via the xas operation.
Fixes: 3470da3b7d87 ("btrfs: subpage: introduce helpers for writeback status") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo wqu@suse.com Reviewed-by: Johannes Thumshirn johannes.thumshirn@wdc.com Signed-off-by: Naohiro Aota naohiro.aota@wdc.com Signed-off-by: David Sterba dsterba@suse.com
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index c9b3821957f7..cb4f97833dc3 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -448,8 +448,25 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&bfs->lock, flags); bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + + /* + * Don't clear the TOWRITE tag when starting writeback on a still-dirty + * folio. Doing so can cause WB_SYNC_ALL writepages() to overlook it, + * assume writeback is complete, and exit too early — violating sync + * ordering guarantees. + */ if (!folio_test_writeback(folio)) - folio_start_writeback(folio); + __folio_start_writeback(folio, true); + if (!folio_test_dirty(folio)) { + struct address_space *mapping = folio_mapping(folio); + XA_STATE(xas, &mapping->i_pages, folio->index); + unsigned long flags; + + xas_lock_irqsave(&xas, flags); + xas_load(&xas); + xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irqrestore(&xas, flags); + } spin_unlock_irqrestore(&bfs->lock, flags); }
From: Qu Wenruo wqu@suse.com
[ Upstream commit 1e17738d6b76cdc76d240d64de87fa66ba2365f7 ]
Unlike the iomap_folio_state structure, the btrfs_subpage structure has a lot of extra sub-bitmaps, namely:
- writeback sub-bitmap - locked sub-bitmap iomap_folio_state uses an atomic for writeback tracking, while it has no per-block locked tracking.
This is because iomap always locks a single folio, and submits dirty blocks with that folio locked.
But btrfs has async delalloc ranges (for compression), which are queued with their range locked, until the compression is done, then marks the involved range writeback and unlocked.
This means a range can be unlocked and marked writeback at seemingly random timing, thus it needs the extra tracking.
This needs a huge rework on the lifespan of async delalloc range before we can remove/simplify these two sub-bitmaps.
- ordered sub-bitmap - checked sub-bitmap These are for COW-fixup, but as I mentioned in the past, the COW-fixup is not really needed anymore and these two flags are already marked deprecated, and will be removed in the near future after comprehensive tests.
Add related comments to indicate we're actively trying to align the sub-bitmaps to the iomap ones.
Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: b1511360c8ac ("btrfs: subpage: keep TOWRITE tag until folio is cleaned") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/subpage.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h index 3042c5ea840a..52546e0e97ce 100644 --- a/fs/btrfs/subpage.h +++ b/fs/btrfs/subpage.h @@ -33,8 +33,22 @@ enum { btrfs_bitmap_nr_uptodate = 0, btrfs_bitmap_nr_dirty, btrfs_bitmap_nr_writeback, + /* + * The ordered and checked flags are for COW fixup, already marked + * deprecated, and will be removed eventually. + */ btrfs_bitmap_nr_ordered, btrfs_bitmap_nr_checked, + + /* + * The locked bit is for async delalloc range (compression), currently + * async extent is queued with the range locked, until the compression + * is done. + * So an async extent can unlock the range at any random timing. + * + * This will need a rework on the async extent lifespan (mark writeback + * and do compression) before deprecating this flag. + */ btrfs_bitmap_nr_locked, btrfs_bitmap_nr_max };
From: Qu Wenruo wqu@suse.com
[ Upstream commit 582cd4bad4332cca95c578e99442eb148366eb82 ]
With the incoming large data folios support, the structure name btrfs_subpage is no longer correct, as for we can have multiple blocks inside a large folio, and the block size is still page size.
So to follow the schema of iomap, rename btrfs_subpage to btrfs_folio_state, along with involved enums.
There are still exported functions with "btrfs_subpage_" prefix, and I believe for metadata the name "subpage" will stay forever as we will never allocate a folio larger than nodesize anyway.
The full cleanup of the word "subpage" will happen in much smaller steps in the future.
Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: b1511360c8ac ("btrfs: subpage: keep TOWRITE tag until folio is cleaned") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 28 ++--- fs/btrfs/inode.c | 8 +- fs/btrfs/subpage.c | 239 +++++++++++++++++++++---------------------- fs/btrfs/subpage.h | 31 ++++-- 4 files changed, 156 insertions(+), 150 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1dc931c4937f..d11e1d192784 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -782,7 +782,7 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
static int attach_extent_buffer_folio(struct extent_buffer *eb, struct folio *folio, - struct btrfs_subpage *prealloc) + struct btrfs_folio_state *prealloc) { struct btrfs_fs_info *fs_info = eb->fs_info; int ret = 0; @@ -806,7 +806,7 @@ static int attach_extent_buffer_folio(struct extent_buffer *eb,
/* Already mapped, just free prealloc */ if (folio_test_private(folio)) { - btrfs_free_subpage(prealloc); + btrfs_free_folio_state(prealloc); return 0; }
@@ -815,7 +815,7 @@ static int attach_extent_buffer_folio(struct extent_buffer *eb, folio_attach_private(folio, prealloc); else /* Do new allocation to attach subpage */ - ret = btrfs_attach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA); + ret = btrfs_attach_folio_state(fs_info, folio, BTRFS_SUBPAGE_METADATA); return ret; }
@@ -831,7 +831,7 @@ int set_folio_extent_mapped(struct folio *folio) fs_info = folio_to_fs_info(folio);
if (btrfs_is_subpage(fs_info, folio)) - return btrfs_attach_subpage(fs_info, folio, BTRFS_SUBPAGE_DATA); + return btrfs_attach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
folio_attach_private(folio, (void *)EXTENT_FOLIO_PRIVATE); return 0; @@ -848,7 +848,7 @@ void clear_folio_extent_mapped(struct folio *folio)
fs_info = folio_to_fs_info(folio); if (btrfs_is_subpage(fs_info, folio)) - return btrfs_detach_subpage(fs_info, folio, BTRFS_SUBPAGE_DATA); + return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
folio_detach_private(folio); } @@ -2731,13 +2731,13 @@ static int extent_buffer_under_io(const struct extent_buffer *eb)
static bool folio_range_has_eb(struct folio *folio) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
lockdep_assert_held(&folio->mapping->i_private_lock);
if (folio_test_private(folio)) { - subpage = folio_get_private(folio); - if (atomic_read(&subpage->eb_refs)) + bfs = folio_get_private(folio); + if (atomic_read(&bfs->eb_refs)) return true; } return false; @@ -2787,7 +2787,7 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo * attached to one dummy eb, no sharing. */ if (!mapped) { - btrfs_detach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA); + btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_METADATA); return; }
@@ -2798,7 +2798,7 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo * page range and no unfinished IO. */ if (!folio_range_has_eb(folio)) - btrfs_detach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA); + btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_METADATA);
spin_unlock(&mapping->i_private_lock); } @@ -3141,7 +3141,7 @@ static bool check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start) * The caller needs to free the existing folios and retry using the same order. */ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i, - struct btrfs_subpage *prealloc, + struct btrfs_folio_state *prealloc, struct extent_buffer **found_eb_ret) {
@@ -3224,7 +3224,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, int attached = 0; struct extent_buffer *eb; struct extent_buffer *existing_eb = NULL; - struct btrfs_subpage *prealloc = NULL; + struct btrfs_folio_state *prealloc = NULL; u64 lockdep_owner = owner_root; bool page_contig = true; int uptodate = 1; @@ -3269,7 +3269,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, * manually if we exit earlier. */ if (btrfs_meta_is_subpage(fs_info)) { - prealloc = btrfs_alloc_subpage(fs_info, PAGE_SIZE, BTRFS_SUBPAGE_METADATA); + prealloc = btrfs_alloc_folio_state(fs_info, PAGE_SIZE, BTRFS_SUBPAGE_METADATA); if (IS_ERR(prealloc)) { ret = PTR_ERR(prealloc); goto out; @@ -3280,7 +3280,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, /* Allocate all pages first. */ ret = alloc_eb_folio_array(eb, true); if (ret < 0) { - btrfs_free_subpage(prealloc); + btrfs_free_folio_state(prealloc); goto out; }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4ae34c22ff1d..df4c8312aae3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7364,13 +7364,13 @@ struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start, static void wait_subpage_spinlock(struct folio *folio) { struct btrfs_fs_info *fs_info = folio_to_fs_info(folio); - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
if (!btrfs_is_subpage(fs_info, folio)) return;
ASSERT(folio_test_private(folio) && folio_get_private(folio)); - subpage = folio_get_private(folio); + bfs = folio_get_private(folio);
/* * This may look insane as we just acquire the spinlock and release it, @@ -7383,8 +7383,8 @@ static void wait_subpage_spinlock(struct folio *folio) * Here we just acquire the spinlock so that all existing callers * should exit and we're safe to release/invalidate the page. */ - spin_lock_irq(&subpage->lock); - spin_unlock_irq(&subpage->lock); + spin_lock_irq(&bfs->lock); + spin_unlock_irq(&bfs->lock); }
static int btrfs_launder_folio(struct folio *folio) diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index d4f019233493..9b63a4d1c989 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -49,7 +49,7 @@ * Implementation: * * - Common - * Both metadata and data will use a new structure, btrfs_subpage, to + * Both metadata and data will use a new structure, btrfs_folio_state, to * record the status of each sector inside a page. This provides the extra * granularity needed. * @@ -63,10 +63,10 @@ * This means a slightly higher tree locking latency. */
-int btrfs_attach_subpage(const struct btrfs_fs_info *fs_info, - struct folio *folio, enum btrfs_subpage_type type) +int btrfs_attach_folio_state(const struct btrfs_fs_info *fs_info, + struct folio *folio, enum btrfs_folio_type type) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
/* For metadata we don't support large folio yet. */ if (type == BTRFS_SUBPAGE_METADATA) @@ -87,18 +87,18 @@ int btrfs_attach_subpage(const struct btrfs_fs_info *fs_info, if (type == BTRFS_SUBPAGE_DATA && !btrfs_is_subpage(fs_info, folio)) return 0;
- subpage = btrfs_alloc_subpage(fs_info, folio_size(folio), type); - if (IS_ERR(subpage)) - return PTR_ERR(subpage); + bfs = btrfs_alloc_folio_state(fs_info, folio_size(folio), type); + if (IS_ERR(bfs)) + return PTR_ERR(bfs);
- folio_attach_private(folio, subpage); + folio_attach_private(folio, bfs); return 0; }
-void btrfs_detach_subpage(const struct btrfs_fs_info *fs_info, struct folio *folio, - enum btrfs_subpage_type type) +void btrfs_detach_folio_state(const struct btrfs_fs_info *fs_info, struct folio *folio, + enum btrfs_folio_type type) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
/* Either not subpage, or the folio already has private attached. */ if (!folio_test_private(folio)) @@ -108,15 +108,15 @@ void btrfs_detach_subpage(const struct btrfs_fs_info *fs_info, struct folio *fol if (type == BTRFS_SUBPAGE_DATA && !btrfs_is_subpage(fs_info, folio)) return;
- subpage = folio_detach_private(folio); - ASSERT(subpage); - btrfs_free_subpage(subpage); + bfs = folio_detach_private(folio); + ASSERT(bfs); + btrfs_free_folio_state(bfs); }
-struct btrfs_subpage *btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info, - size_t fsize, enum btrfs_subpage_type type) +struct btrfs_folio_state *btrfs_alloc_folio_state(const struct btrfs_fs_info *fs_info, + size_t fsize, enum btrfs_folio_type type) { - struct btrfs_subpage *ret; + struct btrfs_folio_state *ret; unsigned int real_size;
ASSERT(fs_info->sectorsize < fsize); @@ -136,11 +136,6 @@ struct btrfs_subpage *btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info, return ret; }
-void btrfs_free_subpage(struct btrfs_subpage *subpage) -{ - kfree(subpage); -} - /* * Increase the eb_refs of current subpage. * @@ -152,7 +147,7 @@ void btrfs_free_subpage(struct btrfs_subpage *subpage) */ void btrfs_folio_inc_eb_refs(const struct btrfs_fs_info *fs_info, struct folio *folio) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
if (!btrfs_meta_is_subpage(fs_info)) return; @@ -160,13 +155,13 @@ void btrfs_folio_inc_eb_refs(const struct btrfs_fs_info *fs_info, struct folio * ASSERT(folio_test_private(folio) && folio->mapping); lockdep_assert_held(&folio->mapping->i_private_lock);
- subpage = folio_get_private(folio); - atomic_inc(&subpage->eb_refs); + bfs = folio_get_private(folio); + atomic_inc(&bfs->eb_refs); }
void btrfs_folio_dec_eb_refs(const struct btrfs_fs_info *fs_info, struct folio *folio) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs;
if (!btrfs_meta_is_subpage(fs_info)) return; @@ -174,9 +169,9 @@ void btrfs_folio_dec_eb_refs(const struct btrfs_fs_info *fs_info, struct folio * ASSERT(folio_test_private(folio) && folio->mapping); lockdep_assert_held(&folio->mapping->i_private_lock);
- subpage = folio_get_private(folio); - ASSERT(atomic_read(&subpage->eb_refs)); - atomic_dec(&subpage->eb_refs); + bfs = folio_get_private(folio); + ASSERT(atomic_read(&bfs->eb_refs)); + atomic_dec(&bfs->eb_refs); }
static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info, @@ -228,7 +223,7 @@ static void btrfs_subpage_clamp_range(struct folio *folio, u64 *start, u32 *len) static bool btrfs_subpage_end_and_test_lock(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); const int start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len); const int nbits = (len >> fs_info->sectorsize_bits); unsigned long flags; @@ -238,7 +233,7 @@ static bool btrfs_subpage_end_and_test_lock(const struct btrfs_fs_info *fs_info,
btrfs_subpage_assert(fs_info, folio, start, len);
- spin_lock_irqsave(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); /* * We have call sites passing @lock_page into * extent_clear_unlock_delalloc() for compression path. @@ -246,18 +241,18 @@ static bool btrfs_subpage_end_and_test_lock(const struct btrfs_fs_info *fs_info, * This @locked_page is locked by plain lock_page(), thus its * subpage::locked is 0. Handle them in a special way. */ - if (atomic_read(&subpage->nr_locked) == 0) { - spin_unlock_irqrestore(&subpage->lock, flags); + if (atomic_read(&bfs->nr_locked) == 0) { + spin_unlock_irqrestore(&bfs->lock, flags); return true; }
- for_each_set_bit_from(bit, subpage->bitmaps, start_bit + nbits) { - clear_bit(bit, subpage->bitmaps); + for_each_set_bit_from(bit, bfs->bitmaps, start_bit + nbits) { + clear_bit(bit, bfs->bitmaps); cleared++; } - ASSERT(atomic_read(&subpage->nr_locked) >= cleared); - last = atomic_sub_and_test(cleared, &subpage->nr_locked); - spin_unlock_irqrestore(&subpage->lock, flags); + ASSERT(atomic_read(&bfs->nr_locked) >= cleared); + last = atomic_sub_and_test(cleared, &bfs->nr_locked); + spin_unlock_irqrestore(&bfs->lock, flags); return last; }
@@ -280,7 +275,7 @@ static bool btrfs_subpage_end_and_test_lock(const struct btrfs_fs_info *fs_info, void btrfs_folio_end_lock(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio);
ASSERT(folio_test_locked(folio));
@@ -296,7 +291,7 @@ void btrfs_folio_end_lock(const struct btrfs_fs_info *fs_info, * Since we own the page lock, no one else could touch subpage::locked * and we are safe to do several atomic operations without spinlock. */ - if (atomic_read(&subpage->nr_locked) == 0) { + if (atomic_read(&bfs->nr_locked) == 0) { /* No subpage lock, locked by plain lock_page(). */ folio_unlock(folio); return; @@ -310,7 +305,7 @@ void btrfs_folio_end_lock(const struct btrfs_fs_info *fs_info, void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, struct folio *folio, unsigned long bitmap) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); const int start_bit = blocks_per_folio * btrfs_bitmap_nr_locked; unsigned long flags; @@ -323,42 +318,42 @@ void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, return; }
- if (atomic_read(&subpage->nr_locked) == 0) { + if (atomic_read(&bfs->nr_locked) == 0) { /* No subpage lock, locked by plain lock_page(). */ folio_unlock(folio); return; }
- spin_lock_irqsave(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); for_each_set_bit(bit, &bitmap, blocks_per_folio) { - if (test_and_clear_bit(bit + start_bit, subpage->bitmaps)) + if (test_and_clear_bit(bit + start_bit, bfs->bitmaps)) cleared++; } - ASSERT(atomic_read(&subpage->nr_locked) >= cleared); - last = atomic_sub_and_test(cleared, &subpage->nr_locked); - spin_unlock_irqrestore(&subpage->lock, flags); + ASSERT(atomic_read(&bfs->nr_locked) >= cleared); + last = atomic_sub_and_test(cleared, &bfs->nr_locked); + spin_unlock_irqrestore(&bfs->lock, flags); if (last) folio_unlock(folio); }
#define subpage_test_bitmap_all_set(fs_info, folio, name) \ ({ \ - struct btrfs_subpage *subpage = folio_get_private(folio); \ + struct btrfs_folio_state *bfs = folio_get_private(folio); \ const unsigned int blocks_per_folio = \ btrfs_blocks_per_folio(fs_info, folio); \ \ - bitmap_test_range_all_set(subpage->bitmaps, \ + bitmap_test_range_all_set(bfs->bitmaps, \ blocks_per_folio * btrfs_bitmap_nr_##name, \ blocks_per_folio); \ })
#define subpage_test_bitmap_all_zero(fs_info, folio, name) \ ({ \ - struct btrfs_subpage *subpage = folio_get_private(folio); \ + struct btrfs_folio_state *bfs = folio_get_private(folio); \ const unsigned int blocks_per_folio = \ btrfs_blocks_per_folio(fs_info, folio); \ \ - bitmap_test_range_all_zero(subpage->bitmaps, \ + bitmap_test_range_all_zero(bfs->bitmaps, \ blocks_per_folio * btrfs_bitmap_nr_##name, \ blocks_per_folio); \ }) @@ -366,43 +361,43 @@ void btrfs_folio_end_lock_bitmap(const struct btrfs_fs_info *fs_info, void btrfs_subpage_set_uptodate(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, uptodate, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (subpage_test_bitmap_all_set(fs_info, folio, uptodate)) folio_mark_uptodate(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_clear_uptodate(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, uptodate, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); folio_clear_uptodate(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_unlock_irqrestore(&bfs->lock, flags); folio_mark_dirty(folio); }
@@ -419,17 +414,17 @@ void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info, bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len); unsigned long flags; bool last = false;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (subpage_test_bitmap_all_zero(fs_info, folio, dirty)) last = true; - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); return last; }
@@ -446,91 +441,91 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info, void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, writeback, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (!folio_test_writeback(folio)) folio_start_writeback(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, writeback, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (subpage_test_bitmap_all_zero(fs_info, folio, writeback)) { ASSERT(folio_test_writeback(folio)); folio_end_writeback(folio); } - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, ordered, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); folio_set_ordered(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, ordered, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (subpage_test_bitmap_all_zero(fs_info, folio, ordered)) folio_clear_ordered(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_set_checked(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, checked, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); if (subpage_test_bitmap_all_set(fs_info, folio, checked)) folio_set_checked(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
void btrfs_subpage_clear_checked(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage = folio_get_private(folio); + struct btrfs_folio_state *bfs = folio_get_private(folio); unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, checked, start, len); unsigned long flags;
- spin_lock_irqsave(&subpage->lock, flags); - bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + spin_lock_irqsave(&bfs->lock, flags); + bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); folio_clear_checked(folio); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
/* @@ -541,16 +536,16 @@ void btrfs_subpage_clear_checked(const struct btrfs_fs_info *fs_info, bool btrfs_subpage_test_##name(const struct btrfs_fs_info *fs_info, \ struct folio *folio, u64 start, u32 len) \ { \ - struct btrfs_subpage *subpage = folio_get_private(folio); \ + struct btrfs_folio_state *bfs = folio_get_private(folio); \ unsigned int start_bit = subpage_calc_start_bit(fs_info, folio, \ name, start, len); \ unsigned long flags; \ bool ret; \ \ - spin_lock_irqsave(&subpage->lock, flags); \ - ret = bitmap_test_range_all_set(subpage->bitmaps, start_bit, \ + spin_lock_irqsave(&bfs->lock, flags); \ + ret = bitmap_test_range_all_set(bfs->bitmaps, start_bit, \ len >> fs_info->sectorsize_bits); \ - spin_unlock_irqrestore(&subpage->lock, flags); \ + spin_unlock_irqrestore(&bfs->lock, flags); \ return ret; \ } IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate); @@ -662,10 +657,10 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked, { \ const unsigned int blocks_per_folio = \ btrfs_blocks_per_folio(fs_info, folio); \ - const struct btrfs_subpage *subpage = folio_get_private(folio); \ + const struct btrfs_folio_state *bfs = folio_get_private(folio); \ \ ASSERT(blocks_per_folio <= BITS_PER_LONG); \ - *dst = bitmap_read(subpage->bitmaps, \ + *dst = bitmap_read(bfs->bitmaps, \ blocks_per_folio * btrfs_bitmap_nr_##name, \ blocks_per_folio); \ } @@ -690,7 +685,7 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked, void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs; unsigned int start_bit; unsigned int nbits; unsigned long flags; @@ -705,15 +700,15 @@ void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len); nbits = len >> fs_info->sectorsize_bits; - subpage = folio_get_private(folio); - ASSERT(subpage); - spin_lock_irqsave(&subpage->lock, flags); - if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) { + bfs = folio_get_private(folio); + ASSERT(bfs); + spin_lock_irqsave(&bfs->lock, flags); + if (unlikely(!bitmap_test_range_all_zero(bfs->bitmaps, start_bit, nbits))) { SUBPAGE_DUMP_BITMAP(fs_info, folio, dirty, start, len); - ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); + ASSERT(bitmap_test_range_all_zero(bfs->bitmaps, start_bit, nbits)); } - ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); - spin_unlock_irqrestore(&subpage->lock, flags); + ASSERT(bitmap_test_range_all_zero(bfs->bitmaps, start_bit, nbits)); + spin_unlock_irqrestore(&bfs->lock, flags); }
/* @@ -726,7 +721,7 @@ void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs; unsigned long flags; unsigned int start_bit; unsigned int nbits; @@ -736,19 +731,19 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info, if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio)) return;
- subpage = folio_get_private(folio); + bfs = folio_get_private(folio); start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len); nbits = len >> fs_info->sectorsize_bits; - spin_lock_irqsave(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); /* Target range should not yet be locked. */ - if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) { + if (unlikely(!bitmap_test_range_all_zero(bfs->bitmaps, start_bit, nbits))) { SUBPAGE_DUMP_BITMAP(fs_info, folio, locked, start, len); - ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits)); + ASSERT(bitmap_test_range_all_zero(bfs->bitmaps, start_bit, nbits)); } - bitmap_set(subpage->bitmaps, start_bit, nbits); - ret = atomic_add_return(nbits, &subpage->nr_locked); + bitmap_set(bfs->bitmaps, start_bit, nbits); + ret = atomic_add_return(nbits, &bfs->nr_locked); ASSERT(ret <= btrfs_blocks_per_folio(fs_info, folio)); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); }
/* @@ -776,7 +771,7 @@ bool btrfs_meta_folio_clear_and_test_dirty(struct folio *folio, const struct ext void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info, struct folio *folio, u64 start, u32 len) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs; const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); unsigned long uptodate_bitmap; unsigned long dirty_bitmap; @@ -788,18 +783,18 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
ASSERT(folio_test_private(folio) && folio_get_private(folio)); ASSERT(blocks_per_folio > 1); - subpage = folio_get_private(folio); + bfs = folio_get_private(folio);
- spin_lock_irqsave(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); GET_SUBPAGE_BITMAP(fs_info, folio, uptodate, &uptodate_bitmap); GET_SUBPAGE_BITMAP(fs_info, folio, dirty, &dirty_bitmap); GET_SUBPAGE_BITMAP(fs_info, folio, writeback, &writeback_bitmap); GET_SUBPAGE_BITMAP(fs_info, folio, ordered, &ordered_bitmap); GET_SUBPAGE_BITMAP(fs_info, folio, checked, &checked_bitmap); GET_SUBPAGE_BITMAP(fs_info, folio, locked, &locked_bitmap); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags);
- dump_page(folio_page(folio, 0), "btrfs subpage dump"); + dump_page(folio_page(folio, 0), "btrfs folio state dump"); btrfs_warn(fs_info, "start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl", start, len, folio_pos(folio), @@ -815,14 +810,14 @@ void btrfs_get_subpage_dirty_bitmap(struct btrfs_fs_info *fs_info, struct folio *folio, unsigned long *ret_bitmap) { - struct btrfs_subpage *subpage; + struct btrfs_folio_state *bfs; unsigned long flags;
ASSERT(folio_test_private(folio) && folio_get_private(folio)); ASSERT(btrfs_blocks_per_folio(fs_info, folio) > 1); - subpage = folio_get_private(folio); + bfs = folio_get_private(folio);
- spin_lock_irqsave(&subpage->lock, flags); + spin_lock_irqsave(&bfs->lock, flags); GET_SUBPAGE_BITMAP(fs_info, folio, dirty, ret_bitmap); - spin_unlock_irqrestore(&subpage->lock, flags); + spin_unlock_irqrestore(&bfs->lock, flags); } diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h index 52546e0e97ce..b6e40a678d73 100644 --- a/fs/btrfs/subpage.h +++ b/fs/btrfs/subpage.h @@ -32,7 +32,15 @@ struct folio; enum { btrfs_bitmap_nr_uptodate = 0, btrfs_bitmap_nr_dirty, + + /* + * This can be changed to atomic eventually. But this change will rely + * on the async delalloc range rework for locked bitmap. As async + * delalloc can unlock its range and mark blocks writeback at random + * timing. + */ btrfs_bitmap_nr_writeback, + /* * The ordered and checked flags are for COW fixup, already marked * deprecated, and will be removed eventually. @@ -57,7 +65,7 @@ enum { * Structure to trace status of each sector inside a page, attached to * page::private for both data and metadata inodes. */ -struct btrfs_subpage { +struct btrfs_folio_state { /* Common members for both data and metadata pages */ spinlock_t lock; union { @@ -65,7 +73,7 @@ struct btrfs_subpage { * Structures only used by metadata * * @eb_refs should only be operated under private_lock, as it - * manages whether the subpage can be detached. + * manages whether the btrfs_folio_state can be detached. */ atomic_t eb_refs;
@@ -79,7 +87,7 @@ struct btrfs_subpage { unsigned long bitmaps[]; };
-enum btrfs_subpage_type { +enum btrfs_folio_type { BTRFS_SUBPAGE_METADATA, BTRFS_SUBPAGE_DATA, }; @@ -119,15 +127,18 @@ static inline bool btrfs_is_subpage(const struct btrfs_fs_info *fs_info, } #endif
-int btrfs_attach_subpage(const struct btrfs_fs_info *fs_info, - struct folio *folio, enum btrfs_subpage_type type); -void btrfs_detach_subpage(const struct btrfs_fs_info *fs_info, struct folio *folio, - enum btrfs_subpage_type type); +int btrfs_attach_folio_state(const struct btrfs_fs_info *fs_info, + struct folio *folio, enum btrfs_folio_type type); +void btrfs_detach_folio_state(const struct btrfs_fs_info *fs_info, struct folio *folio, + enum btrfs_folio_type type);
/* Allocate additional data where page represents more than one sector */ -struct btrfs_subpage *btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info, - size_t fsize, enum btrfs_subpage_type type); -void btrfs_free_subpage(struct btrfs_subpage *subpage); +struct btrfs_folio_state *btrfs_alloc_folio_state(const struct btrfs_fs_info *fs_info, + size_t fsize, enum btrfs_folio_type type); +static inline void btrfs_free_folio_state(struct btrfs_folio_state *bfs) +{ + kfree(bfs); +}
void btrfs_folio_inc_eb_refs(const struct btrfs_fs_info *fs_info, struct folio *folio); void btrfs_folio_dec_eb_refs(const struct btrfs_fs_info *fs_info, struct folio *folio);
From: Naohiro Aota naohiro.aota@wdc.com
[ Upstream commit b1511360c8ac882b0c52caa263620538e8d73220 ]
btrfs_subpage_set_writeback() calls folio_start_writeback() the first time a folio is written back, and it also clears the PAGECACHE_TAG_TOWRITE tag even if there are still dirty blocks in the folio. This can break ordering guarantees, such as those required by btrfs_wait_ordered_extents().
That ordering breakage leads to a real failure. For example, running generic/464 on a zoned setup will hit the following ASSERT. This happens because the broken ordering fails to flush existing dirty pages before the file size is truncated.
assertion failed: !list_empty(&ordered->list) :: 0, in fs/btrfs/zoned.c:1899 ------------[ cut here ]------------ kernel BUG at fs/btrfs/zoned.c:1899! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 1906169 Comm: kworker/u130:2 Kdump: loaded Not tainted 6.16.0-rc6-BTRFS-ZNS+ #554 PREEMPT(voluntary) Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_finish_ordered_zoned.cold+0x50/0x52 [btrfs] RSP: 0018:ffffc9002efdbd60 EFLAGS: 00010246 RAX: 000000000000004c RBX: ffff88811923c4e0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff827e38b1 RDI: 00000000ffffffff RBP: ffff88810005d000 R08: 00000000ffffdfff R09: ffffffff831051c8 R10: ffffffff83055220 R11: 0000000000000000 R12: ffff8881c2458c00 R13: ffff88811923c540 R14: ffff88811923c5e8 R15: ffff8881c1bd9680 FS: 0000000000000000(0000) GS:ffff88a04acd0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f907c7a918c CR3: 0000000004024000 CR4: 0000000000350ef0 Call Trace: <TASK> ? srso_return_thunk+0x5/0x5f btrfs_finish_ordered_io+0x4a/0x60 [btrfs] btrfs_work_helper+0xf9/0x490 [btrfs] process_one_work+0x204/0x590 ? srso_return_thunk+0x5/0x5f worker_thread+0x1d6/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x118/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x205/0x260 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK>
Consider process A calling writepages() with WB_SYNC_NONE. In zoned mode or for compressed writes, it locks several folios for delalloc and starts writing them out. Let's call the last locked folio folio X. Suppose the write range only partially covers folio X, leaving some pages dirty. Process A calls btrfs_subpage_set_writeback() when building a bio. This function call clears the TOWRITE tag of folio X, whose size = 8K and the block size = 4K. It is following state.
0 4K 8K |/////|/////| (flag: DIRTY, tag: DIRTY) <-----> Process A will write this range.
Now suppose process B concurrently calls writepages() with WB_SYNC_ALL. It calls tag_pages_for_writeback() to tag dirty folios with PAGECACHE_TAG_TOWRITE. Since folio X is still dirty, it gets tagged. Then, B collects tagged folios using filemap_get_folios_tag() and must wait for folio X to be written before returning from writepages().
0 4K 8K |/////|/////| (flag: DIRTY, tag: DIRTY|TOWRITE)
However, between tagging and collecting, process A may call btrfs_subpage_set_writeback() and clear folio X's TOWRITE tag. 0 4K 8K | |/////| (flag: DIRTY|WRITEBACK, tag: DIRTY)
As a result, process B won't see folio X in its batch, and returns without waiting for it. This breaks the WB_SYNC_ALL ordering requirement.
Fix this by using btrfs_subpage_set_writeback_keepwrite(), which retains the TOWRITE tag. We now manually clear the tag only after the folio becomes clean, via the xas operation.
Fixes: 3470da3b7d87 ("btrfs: subpage: introduce helpers for writeback status") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo wqu@suse.com Reviewed-by: Johannes Thumshirn johannes.thumshirn@wdc.com Signed-off-by: Naohiro Aota naohiro.aota@wdc.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/subpage.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index 9b63a4d1c989..2951fdc5db4e 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -448,8 +448,25 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
spin_lock_irqsave(&bfs->lock, flags); bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits); + + /* + * Don't clear the TOWRITE tag when starting writeback on a still-dirty + * folio. Doing so can cause WB_SYNC_ALL writepages() to overlook it, + * assume writeback is complete, and exit too early — violating sync + * ordering guarantees. + */ if (!folio_test_writeback(folio)) - folio_start_writeback(folio); + __folio_start_writeback(folio, true); + if (!folio_test_dirty(folio)) { + struct address_space *mapping = folio_mapping(folio); + XA_STATE(xas, &mapping->i_pages, folio->index); + unsigned long flags; + + xas_lock_irqsave(&xas, flags); + xas_load(&xas); + xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irqrestore(&xas, flags); + } spin_unlock_irqrestore(&bfs->lock, flags); }
linux-stable-mirror@lists.linaro.org