The patch below does not apply to the 6.16-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.16.y git checkout FETCH_HEAD git cherry-pick -x ad580dfa388fabb52af033e3f8cc5d04be985e54 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2025081814-monsoon-supermom-44bb@gregkh' --subject-prefix 'PATCH 6.16.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ad580dfa388fabb52af033e3f8cc5d04be985e54 Mon Sep 17 00:00:00 2001 From: Leo Martins loemra.dev@gmail.com Date: Mon, 21 Jul 2025 10:49:16 -0700 Subject: [PATCH] btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()
There is a potential deadlock that can happen in try_release_subpage_extent_buffer() because the irq-safe xarray spin lock fs_info->buffer_tree is being acquired before the irq-unsafe eb->refs_lock.
This leads to the potential race: // T1 (random eb->refs user) // T2 (release folio)
spin_lock(&eb->refs_lock); // interrupt end_bbio_meta_write() btrfs_meta_folio_clear_writeback() btree_release_folio() folio_test_writeback() //false try_release_extent_buffer() try_release_subpage_extent_buffer() xa_lock_irq(&fs_info->buffer_tree) spin_lock(&eb->refs_lock); // blocked; held by T1 buffer_tree_clear_mark() xas_lock_irqsave() // blocked; held by T2
I believe that the spin lock can safely be replaced by an rcu_read_lock. The xa_for_each loop does not need the spin lock as it's already internally protected by the rcu_read_lock. The extent buffer is also protected by the rcu_read_lock so it won't be freed before we take the eb->refs_lock and check the ref count.
The rcu_read_lock is taken and released every iteration, just like the spin lock, which means we're not protected against concurrent insertions into the xarray. This is fine because we rely on folio->private to detect if there are any ebs remaining in the folio.
There is already some precedent for this with find_extent_buffer_nolock, which loads an extent buffer from the xarray with only rcu_read_lock.
lockdep warning:
===================================================== WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected 6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E N ----------------------------------------------------- kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560
and this task is already holding: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560 which would create a new lock dependency: (&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}
but this new dependency connects a HARDIRQ-irq-safe lock: (&buffer_xa_class){-.-.}-{3:3}
... which became HARDIRQ-irq-safe at: lock_acquire+0x178/0x358 _raw_spin_lock_irqsave+0x60/0x88 buffer_tree_clear_mark+0xc4/0x160 end_bbio_meta_write+0x238/0x398 btrfs_bio_end_io+0x1f8/0x330 btrfs_orig_write_end_io+0x1c4/0x2c0 bio_endio+0x63c/0x678 blk_update_request+0x1c4/0xa00 blk_mq_end_request+0x54/0x88 virtblk_request_done+0x124/0x1d0 blk_mq_complete_request+0x84/0xa0 virtblk_done+0x130/0x238 vring_interrupt+0x130/0x288 __handle_irq_event_percpu+0x1e8/0x708 handle_irq_event+0x98/0x1b0 handle_fasteoi_irq+0x264/0x7c0 generic_handle_domain_irq+0xa4/0x108 gic_handle_irq+0x7c/0x1a0 do_interrupt_handler+0xe4/0x148 el1_interrupt+0x30/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 _raw_spin_unlock_irq+0x38/0x70 __run_timer_base+0xdc/0x5e0 run_timer_softirq+0xa0/0x138 handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0 ____do_softirq.llvm.17674514681856217165+0x18/0x28 call_on_irq_stack+0x24/0x30 __irq_exit_rcu+0x164/0x430 irq_exit_rcu+0x18/0x88 el1_interrupt+0x34/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 arch_local_irq_enable+0x4/0x8 do_idle+0x1a0/0x3b8 cpu_startup_entry+0x60/0x80 rest_init+0x204/0x228 start_kernel+0x394/0x3f0 __primary_switched+0x8c/0x8958
to a HARDIRQ-irq-unsafe lock: (&eb->refs_lock){+.+.}-{3:3}
... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0x178/0x358 _raw_spin_lock+0x4c/0x68 free_extent_buffer_stale+0x2c/0x170 btrfs_read_sys_array+0x1b0/0x338 open_ctree+0xeb0/0x1df8 btrfs_get_tree+0xb60/0x1110 vfs_get_tree+0x8c/0x250 fc_mount+0x20/0x98 btrfs_get_tree+0x4a4/0x1110 vfs_get_tree+0x8c/0x250 do_new_mount+0x1e0/0x6c0 path_mount+0x4ec/0xa58 __arm64_sys_mount+0x370/0x490 invoke_syscall+0x6c/0x208 el0_svc_common+0x14c/0x1b8 do_el0_svc+0x4c/0x60 el0_svc+0x4c/0x160 el0t_64_sync_handler+0x70/0x100 el0t_64_sync+0x168/0x170
other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&eb->refs_lock); local_irq_disable(); lock(&buffer_xa_class); lock(&eb->refs_lock); <Interrupt> lock(&buffer_xa_class);
*** DEADLOCK *** 2 locks held by kswapd0/66: #0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50 #1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=... Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Boris Burkov boris@bur.io Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Leo Martins loemra.dev@gmail.com Signed-off-by: David Sterba dsterba@suse.com
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 835b0deef9bb..f23d75986947 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4331,15 +4331,18 @@ static int try_release_subpage_extent_buffer(struct folio *folio) unsigned long end = index + (PAGE_SIZE >> fs_info->nodesize_bits) - 1; int ret;
- xa_lock_irq(&fs_info->buffer_tree); + rcu_read_lock(); xa_for_each_range(&fs_info->buffer_tree, index, eb, start, end) { /* * The same as try_release_extent_buffer(), to ensure the eb * won't disappear out from under us. */ spin_lock(&eb->refs_lock); + rcu_read_unlock(); + if (refcount_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { spin_unlock(&eb->refs_lock); + rcu_read_lock(); continue; }
@@ -4358,11 +4361,10 @@ static int try_release_subpage_extent_buffer(struct folio *folio) * check the folio private at the end. And * release_extent_buffer() will release the refs_lock. */ - xa_unlock_irq(&fs_info->buffer_tree); release_extent_buffer(eb); - xa_lock_irq(&fs_info->buffer_tree); + rcu_read_lock(); } - xa_unlock_irq(&fs_info->buffer_tree); + rcu_read_unlock();
/* * Finally to check if we have cleared folio private, as if we have @@ -4375,7 +4377,6 @@ static int try_release_subpage_extent_buffer(struct folio *folio) ret = 0; spin_unlock(&folio->mapping->i_private_lock); return ret; - }
int try_release_extent_buffer(struct folio *folio)
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 71c086b30d4373a01bd5627f54516a72891a026a ]
It's hard to read the logic to break out of the while loop since it's a very long expression consisting of a logical or of two composite expressions, each one composed by a logical and. Further each one is also testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than necessary.
So change from this:
if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3) || (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs == 1)) break;
To this:
if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { if (refs == 1) break; } else if (refs <= 3) { break; }
At least on x86_64 using gcc 9.3.0, this doesn't change the object size.
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1dc931c4937f..8590f8a4a139 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3486,10 +3486,13 @@ void free_extent_buffer(struct extent_buffer *eb)
refs = atomic_read(&eb->refs); while (1) { - if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3) - || (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && - refs == 1)) + if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { + if (refs == 1) + break; + } else if (refs <= 3) { break; + } + if (atomic_try_cmpxchg(&eb->refs, &refs, refs - 1)) return; }
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 2697b6159744e5afae0f7715da9f830ba6f9e45a ]
There's this special atomic compare and exchange logic which serves to avoid locking the extent buffers refs_lock spinlock and therefore reduce lock contention, so add a comment to make it more obvious.
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 8590f8a4a139..9cf37693d609 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3493,6 +3493,7 @@ void free_extent_buffer(struct extent_buffer *eb) break; }
+ /* Optimization to avoid locking eb->refs_lock. */ if (atomic_try_cmpxchg(&eb->refs, &refs, refs - 1)) return; }
From: Filipe Manana fdmanana@suse.com
[ Upstream commit b769777d927af168b1389388392bfd7dc4e38399 ]
Instead of using a bare atomic, use the refcount_t type, which despite being a structure that contains only an atomic, has an API that checks for underflows and other hazards. This doesn't change the size of the extent_buffer structure.
This removes the need to do things like this:
WARN_ON(atomic_read(&eb->refs) == 0); if (atomic_dec_and_test(&eb->refs)) { (...) }
And do just:
if (refcount_dec_and_test(&eb->refs)) { (...) }
Since refcount_dec_and_test() already triggers a warning when we decrement a ref count that has a value of 0 (or below zero).
Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Stable-dep-of: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/ctree.c | 14 +++++------ fs/btrfs/extent-tree.c | 2 +- fs/btrfs/extent_io.c | 45 ++++++++++++++++++------------------ fs/btrfs/extent_io.h | 2 +- fs/btrfs/fiemap.c | 2 +- fs/btrfs/print-tree.c | 2 +- fs/btrfs/qgroup.c | 6 ++--- fs/btrfs/relocation.c | 4 ++-- fs/btrfs/tree-log.c | 4 ++-- fs/btrfs/zoned.c | 2 +- include/trace/events/btrfs.h | 2 +- 11 files changed, 42 insertions(+), 43 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 648531fe0900..d13be2cd34c3 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -198,7 +198,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root) * the inc_not_zero dance and if it doesn't work then * synchronize_rcu and try again. */ - if (atomic_inc_not_zero(&eb->refs)) { + if (refcount_inc_not_zero(&eb->refs)) { rcu_read_unlock(); break; } @@ -549,7 +549,7 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); goto error_unlock_cow; } - atomic_inc(&cow->refs); + refcount_inc(&cow->refs); rcu_assign_pointer(root->node, cow);
ret = btrfs_free_tree_block(trans, btrfs_root_id(root), buf, @@ -1081,7 +1081,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, /* update the path */ if (left) { if (btrfs_header_nritems(left) > orig_slot) { - atomic_inc(&left->refs); + refcount_inc(&left->refs); /* left was locked after cow */ path->nodes[level] = left; path->slots[level + 1] -= 1; @@ -1685,7 +1685,7 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
if (p->search_commit_root) { b = root->commit_root; - atomic_inc(&b->refs); + refcount_inc(&b->refs); level = btrfs_header_level(b); /* * Ensure that all callers have set skip_locking when @@ -2885,7 +2885,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans, free_extent_buffer(old);
add_root_to_dirty_list(root); - atomic_inc(&c->refs); + refcount_inc(&c->refs); path->nodes[level] = c; path->locks[level] = BTRFS_WRITE_LOCK; path->slots[level] = 0; @@ -4442,7 +4442,7 @@ static noinline int btrfs_del_leaf(struct btrfs_trans_handle *trans,
root_sub_used_bytes(root);
- atomic_inc(&leaf->refs); + refcount_inc(&leaf->refs); ret = btrfs_free_tree_block(trans, btrfs_root_id(root), leaf, 0, 1); free_extent_buffer_stale(leaf); if (ret < 0) @@ -4527,7 +4527,7 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, * for possible call to btrfs_del_ptr below */ slot = path->slots[1]; - atomic_inc(&leaf->refs); + refcount_inc(&leaf->refs); /* * We want to be able to at least push one item to the * left neighbour leaf, and that's the first item. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cb6128778a83..0e8b5aaa78cb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6341,7 +6341,7 @@ int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
btrfs_assert_tree_write_locked(parent); parent_level = btrfs_header_level(parent); - atomic_inc(&parent->refs); + refcount_inc(&parent->refs); path->nodes[parent_level] = parent; path->slots[parent_level] = btrfs_header_nritems(parent);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9cf37693d609..8549fb4f6dfd 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -77,7 +77,7 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info) struct extent_buffer, leak_list); pr_err( "BTRFS: buffer leak start %llu len %u refs %d bflags %lu owner %llu\n", - eb->start, eb->len, atomic_read(&eb->refs), eb->bflags, + eb->start, eb->len, refcount_read(&eb->refs), eb->bflags, btrfs_header_owner(eb)); list_del(&eb->leak_list); WARN_ON_ONCE(1); @@ -1961,7 +1961,7 @@ static inline struct extent_buffer *find_get_eb(struct xa_state *xas, unsigned l if (!eb) return NULL;
- if (!atomic_inc_not_zero(&eb->refs)) { + if (!refcount_inc_not_zero(&eb->refs)) { xas_reset(xas); goto retry; } @@ -2012,7 +2012,7 @@ static struct extent_buffer *find_extent_buffer_nolock(
rcu_read_lock(); eb = xa_load(&fs_info->buffer_tree, index); - if (eb && !atomic_inc_not_zero(&eb->refs)) + if (eb && !refcount_inc_not_zero(&eb->refs)) eb = NULL; rcu_read_unlock(); return eb; @@ -2842,7 +2842,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info btrfs_leak_debug_add_eb(eb);
spin_lock_init(&eb->refs_lock); - atomic_set(&eb->refs, 1); + refcount_set(&eb->refs, 1);
ASSERT(eb->len <= BTRFS_MAX_METADATA_BLOCKSIZE);
@@ -2975,13 +2975,13 @@ static void check_buffer_tree_ref(struct extent_buffer *eb) * once io is initiated, TREE_REF can no longer be cleared, so that is * the moment at which any such race is best fixed. */ - refs = atomic_read(&eb->refs); + refs = refcount_read(&eb->refs); if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) return;
spin_lock(&eb->refs_lock); if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) - atomic_inc(&eb->refs); + refcount_inc(&eb->refs); spin_unlock(&eb->refs_lock); }
@@ -3047,7 +3047,7 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info, return ERR_PTR(ret); } if (exists) { - if (!atomic_inc_not_zero(&exists->refs)) { + if (!refcount_inc_not_zero(&exists->refs)) { /* The extent buffer is being freed, retry. */ xa_unlock_irq(&fs_info->buffer_tree); goto again; @@ -3092,7 +3092,7 @@ static struct extent_buffer *grab_extent_buffer(struct btrfs_fs_info *fs_info, * just overwrite folio private. */ exists = folio_get_private(folio); - if (atomic_inc_not_zero(&exists->refs)) + if (refcount_inc_not_zero(&exists->refs)) return exists;
WARN_ON(folio_test_dirty(folio)); @@ -3362,7 +3362,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, goto out; } if (existing_eb) { - if (!atomic_inc_not_zero(&existing_eb->refs)) { + if (!refcount_inc_not_zero(&existing_eb->refs)) { xa_unlock_irq(&fs_info->buffer_tree); goto again; } @@ -3391,7 +3391,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, return eb;
out: - WARN_ON(!atomic_dec_and_test(&eb->refs)); + WARN_ON(!refcount_dec_and_test(&eb->refs));
/* * Any attached folios need to be detached before we unlock them. This @@ -3437,8 +3437,7 @@ static int release_extent_buffer(struct extent_buffer *eb) { lockdep_assert_held(&eb->refs_lock);
- WARN_ON(atomic_read(&eb->refs) == 0); - if (atomic_dec_and_test(&eb->refs)) { + if (refcount_dec_and_test(&eb->refs)) { struct btrfs_fs_info *fs_info = eb->fs_info;
spin_unlock(&eb->refs_lock); @@ -3484,7 +3483,7 @@ void free_extent_buffer(struct extent_buffer *eb) if (!eb) return;
- refs = atomic_read(&eb->refs); + refs = refcount_read(&eb->refs); while (1) { if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { if (refs == 1) @@ -3494,16 +3493,16 @@ void free_extent_buffer(struct extent_buffer *eb) }
/* Optimization to avoid locking eb->refs_lock. */ - if (atomic_try_cmpxchg(&eb->refs, &refs, refs - 1)) + if (atomic_try_cmpxchg(&eb->refs.refs, &refs, refs - 1)) return; }
spin_lock(&eb->refs_lock); - if (atomic_read(&eb->refs) == 2 && + if (refcount_read(&eb->refs) == 2 && test_bit(EXTENT_BUFFER_STALE, &eb->bflags) && !extent_buffer_under_io(eb) && test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) - atomic_dec(&eb->refs); + refcount_dec(&eb->refs);
/* * I know this is terrible, but it's temporary until we stop tracking @@ -3520,9 +3519,9 @@ void free_extent_buffer_stale(struct extent_buffer *eb) spin_lock(&eb->refs_lock); set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
- if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) && + if (refcount_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) && test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) - atomic_dec(&eb->refs); + refcount_dec(&eb->refs); release_extent_buffer(eb); }
@@ -3580,7 +3579,7 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans, btree_clear_folio_dirty_tag(folio); folio_unlock(folio); } - WARN_ON(atomic_read(&eb->refs) == 0); + WARN_ON(refcount_read(&eb->refs) == 0); }
void set_extent_buffer_dirty(struct extent_buffer *eb) @@ -3591,7 +3590,7 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
- WARN_ON(atomic_read(&eb->refs) == 0); + WARN_ON(refcount_read(&eb->refs) == 0); WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)); WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
@@ -3717,7 +3716,7 @@ int read_extent_buffer_pages_nowait(struct extent_buffer *eb, int mirror_num,
eb->read_mirror = 0; check_buffer_tree_ref(eb); - atomic_inc(&eb->refs); + refcount_inc(&eb->refs);
bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES, REQ_OP_READ | REQ_META, eb->fs_info, @@ -4312,7 +4311,7 @@ static int try_release_subpage_extent_buffer(struct folio *folio) * won't disappear out from under us. */ spin_lock(&eb->refs_lock); - if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { + if (refcount_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { spin_unlock(&eb->refs_lock); continue; } @@ -4378,7 +4377,7 @@ int try_release_extent_buffer(struct folio *folio) * this page. */ spin_lock(&eb->refs_lock); - if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { + if (refcount_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { spin_unlock(&eb->refs_lock); spin_unlock(&folio->mapping->i_private_lock); return 0; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index e36e8d6a00bc..65bb87f1dce6 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -98,7 +98,7 @@ struct extent_buffer { void *addr;
spinlock_t refs_lock; - atomic_t refs; + refcount_t refs; int read_mirror; /* >= 0 if eb belongs to a log tree, -1 otherwise */ s8 log_index; diff --git a/fs/btrfs/fiemap.c b/fs/btrfs/fiemap.c index 43bf0979fd53..7935586a9dbd 100644 --- a/fs/btrfs/fiemap.c +++ b/fs/btrfs/fiemap.c @@ -320,7 +320,7 @@ static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *p * the cost of allocating a new one. */ ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags)); - atomic_inc(&clone->refs); + refcount_inc(&clone->refs);
ret = btrfs_next_leaf(inode->root, path); if (ret != 0) diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c index fc821aa446f0..21605b03f511 100644 --- a/fs/btrfs/print-tree.c +++ b/fs/btrfs/print-tree.c @@ -223,7 +223,7 @@ static void print_eb_refs_lock(const struct extent_buffer *eb) { #ifdef CONFIG_BTRFS_DEBUG btrfs_info(eb->fs_info, "refs %u lock_owner %u current %u", - atomic_read(&eb->refs), eb->lock_owner, current->pid); + refcount_read(&eb->refs), eb->lock_owner, current->pid); #endif }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index b3176edbde82..fef74f76ce58 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2341,7 +2341,7 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans, btrfs_item_key_to_cpu(dst_path->nodes[dst_level], &key, 0);
/* For src_path */ - atomic_inc(&src_eb->refs); + refcount_inc(&src_eb->refs); src_path->nodes[root_level] = src_eb; src_path->slots[root_level] = dst_path->slots[root_level]; src_path->locks[root_level] = 0; @@ -2574,7 +2574,7 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans, goto out; } /* For dst_path */ - atomic_inc(&dst_eb->refs); + refcount_inc(&dst_eb->refs); dst_path->nodes[level] = dst_eb; dst_path->slots[level] = 0; dst_path->locks[level] = 0; @@ -2666,7 +2666,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans, * walk back up the tree (adjusting slot pointers as we go) * and restart the search process. */ - atomic_inc(&root_eb->refs); /* For path */ + refcount_inc(&root_eb->refs); /* For path */ path->nodes[root_level] = root_eb; path->slots[root_level] = 0; path->locks[root_level] = 0; /* so release_path doesn't try to unlock */ diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 02086191630d..c9290fc6f15f 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1516,7 +1516,7 @@ static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { level = btrfs_root_level(root_item); - atomic_inc(&reloc_root->node->refs); + refcount_inc(&reloc_root->node->refs); path->nodes[level] = reloc_root->node; path->slots[level] = 0; } else { @@ -4339,7 +4339,7 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans, }
btrfs_backref_drop_node_buffer(node); - atomic_inc(&cow->refs); + refcount_inc(&cow->refs); node->eb = cow; node->new_bytenr = cow->start;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index cea8a7e9d6d3..f930a5eaaea6 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2720,7 +2720,7 @@ static int walk_log_tree(struct btrfs_trans_handle *trans, level = btrfs_header_level(log->node); orig_level = level; path->nodes[level] = log->node; - atomic_inc(&log->node->refs); + refcount_inc(&log->node->refs); path->slots[level] = 0;
while (1) { @@ -3684,7 +3684,7 @@ static int clone_leaf(struct btrfs_path *path, struct btrfs_log_ctx *ctx) * Add extra ref to scratch eb so that it is not freed when callers * release the path, so we can reuse it later if needed. */ - atomic_inc(&ctx->scratch_eb->refs); + refcount_inc(&ctx->scratch_eb->refs);
return 0; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 9430b34d3cbb..8079c6e1d2b5 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -2485,7 +2485,7 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
/* For the work */ btrfs_get_block_group(bg); - atomic_inc(&eb->refs); + refcount_inc(&eb->refs); bg->last_eb = eb; INIT_WORK(&bg->zone_finish_work, btrfs_zone_finish_endio_workfn); queue_work(system_unbound_wq, &bg->zone_finish_work); diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index bebc252db865..a32305044371 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1095,7 +1095,7 @@ TRACE_EVENT(btrfs_cow_block, TP_fast_assign_btrfs(root->fs_info, __entry->root_objectid = btrfs_root_id(root); __entry->buf_start = buf->start; - __entry->refs = atomic_read(&buf->refs); + __entry->refs = refcount_read(&buf->refs); __entry->cow_start = cow->start; __entry->buf_level = btrfs_header_level(buf); __entry->cow_level = btrfs_header_level(cow);
From: Leo Martins loemra.dev@gmail.com
[ Upstream commit ad580dfa388fabb52af033e3f8cc5d04be985e54 ]
There is a potential deadlock that can happen in try_release_subpage_extent_buffer() because the irq-safe xarray spin lock fs_info->buffer_tree is being acquired before the irq-unsafe eb->refs_lock.
This leads to the potential race: // T1 (random eb->refs user) // T2 (release folio)
spin_lock(&eb->refs_lock); // interrupt end_bbio_meta_write() btrfs_meta_folio_clear_writeback() btree_release_folio() folio_test_writeback() //false try_release_extent_buffer() try_release_subpage_extent_buffer() xa_lock_irq(&fs_info->buffer_tree) spin_lock(&eb->refs_lock); // blocked; held by T1 buffer_tree_clear_mark() xas_lock_irqsave() // blocked; held by T2
I believe that the spin lock can safely be replaced by an rcu_read_lock. The xa_for_each loop does not need the spin lock as it's already internally protected by the rcu_read_lock. The extent buffer is also protected by the rcu_read_lock so it won't be freed before we take the eb->refs_lock and check the ref count.
The rcu_read_lock is taken and released every iteration, just like the spin lock, which means we're not protected against concurrent insertions into the xarray. This is fine because we rely on folio->private to detect if there are any ebs remaining in the folio.
There is already some precedent for this with find_extent_buffer_nolock, which loads an extent buffer from the xarray with only rcu_read_lock.
lockdep warning:
===================================================== WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected 6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E N ----------------------------------------------------- kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560
and this task is already holding: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560 which would create a new lock dependency: (&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}
but this new dependency connects a HARDIRQ-irq-safe lock: (&buffer_xa_class){-.-.}-{3:3}
... which became HARDIRQ-irq-safe at: lock_acquire+0x178/0x358 _raw_spin_lock_irqsave+0x60/0x88 buffer_tree_clear_mark+0xc4/0x160 end_bbio_meta_write+0x238/0x398 btrfs_bio_end_io+0x1f8/0x330 btrfs_orig_write_end_io+0x1c4/0x2c0 bio_endio+0x63c/0x678 blk_update_request+0x1c4/0xa00 blk_mq_end_request+0x54/0x88 virtblk_request_done+0x124/0x1d0 blk_mq_complete_request+0x84/0xa0 virtblk_done+0x130/0x238 vring_interrupt+0x130/0x288 __handle_irq_event_percpu+0x1e8/0x708 handle_irq_event+0x98/0x1b0 handle_fasteoi_irq+0x264/0x7c0 generic_handle_domain_irq+0xa4/0x108 gic_handle_irq+0x7c/0x1a0 do_interrupt_handler+0xe4/0x148 el1_interrupt+0x30/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 _raw_spin_unlock_irq+0x38/0x70 __run_timer_base+0xdc/0x5e0 run_timer_softirq+0xa0/0x138 handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0 ____do_softirq.llvm.17674514681856217165+0x18/0x28 call_on_irq_stack+0x24/0x30 __irq_exit_rcu+0x164/0x430 irq_exit_rcu+0x18/0x88 el1_interrupt+0x34/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 arch_local_irq_enable+0x4/0x8 do_idle+0x1a0/0x3b8 cpu_startup_entry+0x60/0x80 rest_init+0x204/0x228 start_kernel+0x394/0x3f0 __primary_switched+0x8c/0x8958
to a HARDIRQ-irq-unsafe lock: (&eb->refs_lock){+.+.}-{3:3}
... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0x178/0x358 _raw_spin_lock+0x4c/0x68 free_extent_buffer_stale+0x2c/0x170 btrfs_read_sys_array+0x1b0/0x338 open_ctree+0xeb0/0x1df8 btrfs_get_tree+0xb60/0x1110 vfs_get_tree+0x8c/0x250 fc_mount+0x20/0x98 btrfs_get_tree+0x4a4/0x1110 vfs_get_tree+0x8c/0x250 do_new_mount+0x1e0/0x6c0 path_mount+0x4ec/0xa58 __arm64_sys_mount+0x370/0x490 invoke_syscall+0x6c/0x208 el0_svc_common+0x14c/0x1b8 do_el0_svc+0x4c/0x60 el0_svc+0x4c/0x160 el0t_64_sync_handler+0x70/0x100 el0t_64_sync+0x168/0x170
other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&eb->refs_lock); local_irq_disable(); lock(&buffer_xa_class); lock(&eb->refs_lock); <Interrupt> lock(&buffer_xa_class);
*** DEADLOCK *** 2 locks held by kswapd0/66: #0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50 #1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=... Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Boris Burkov boris@bur.io Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Leo Martins loemra.dev@gmail.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/extent_io.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 8549fb4f6dfd..1b3844756efb 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4304,15 +4304,18 @@ static int try_release_subpage_extent_buffer(struct folio *folio) unsigned long end = index + (PAGE_SIZE >> fs_info->sectorsize_bits) - 1; int ret;
- xa_lock_irq(&fs_info->buffer_tree); + rcu_read_lock(); xa_for_each_range(&fs_info->buffer_tree, index, eb, start, end) { /* * The same as try_release_extent_buffer(), to ensure the eb * won't disappear out from under us. */ spin_lock(&eb->refs_lock); + rcu_read_unlock(); + if (refcount_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { spin_unlock(&eb->refs_lock); + rcu_read_lock(); continue; }
@@ -4331,11 +4334,10 @@ static int try_release_subpage_extent_buffer(struct folio *folio) * check the folio private at the end. And * release_extent_buffer() will release the refs_lock. */ - xa_unlock_irq(&fs_info->buffer_tree); release_extent_buffer(eb); - xa_lock_irq(&fs_info->buffer_tree); + rcu_read_lock(); } - xa_unlock_irq(&fs_info->buffer_tree); + rcu_read_unlock();
/* * Finally to check if we have cleared folio private, as if we have @@ -4348,7 +4350,6 @@ static int try_release_subpage_extent_buffer(struct folio *folio) ret = 0; spin_unlock(&folio->mapping->i_private_lock); return ret; - }
int try_release_extent_buffer(struct folio *folio)
linux-stable-mirror@lists.linaro.org