On 06.03.24 13:39, Filipe Manana wrote:
On Mon, Mar 4, 2024 at 9:26 PM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
6.7-stable review patch. If anyone has any objections, please let me know.
It would be better to delay the backport of this patch (and the followup fix) to any stable release, because it introduced another regression for which there is a reviewed fix but it's not yet in Linus' tree:
https://lore.kernel.org/linux-btrfs/cover.1709202499.git.fdmanana@suse.com/
Those two missed 6.8 afaics. Will those be heading to mainline any time soon? And how fast afterwards will it be wise to backport them to 6.8? Will anyone ask Greg for that when the time has come? Same for backporting "btrfs: fix deadlock with fiemap and extent locking", the followup fix and the two fixed mentioned above for 6.6 and 6.7?
Cioa, Thorsten
From: Josef Bacik josef@toxicpanda.com
[ Upstream commit b0ad381fa7690244802aed119b478b4bdafc31dd ]
While working on the patchset to remove extent locking I got a lockdep splat with fiemap and pagefaulting with my new extent lock replacement lock.
This deadlock exists with our normal code, we just don't have lockdep annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration we have the chance of pagefaulting. Because we hold the extent lock for the entire range we could mkwrite into a range in the file that we have mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0 [<0>] btrfs_page_mkwrite+0x273/0x8a0 [<0>] do_page_mkwrite+0x50/0xb0 [<0>] do_fault+0xc1/0x7b0 [<0>] __handle_mm_fault+0x2fa/0x460 [<0>] handle_mm_fault+0xa4/0x330 [<0>] do_user_addr_fault+0x1f4/0x800 [<0>] exc_page_fault+0x7c/0x1e0 [<0>] asm_exc_page_fault+0x26/0x30 [<0>] rep_movs_alternative+0x33/0x70 [<0>] _copy_to_user+0x49/0x70 [<0>] fiemap_fill_next_extent+0xc8/0x120 [<0>] emit_fiemap_extent+0x4d/0xa0 [<0>] extent_fiemap+0x7f8/0xad0 [<0>] btrfs_fiemap+0x49/0x80 [<0>] __x64_sys_ioctl+0x3e1/0xb50 [<0>] do_syscall_64+0x94/0x1a0 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of the fiemap. This is safe in general because we keep track of where we are when we're searching the tree, so if an ordered extent updates in the middle of our fiemap call we'll still emit the correct extents because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the delalloc stuff can change during writeback we want to lock the extent range so we have a consistent view of delalloc at the time we're checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org
fs/btrfs/extent_io.c | 62 ++++++++++++++++++++++++++++++++------------ 1 file changed, 45 insertions(+), 17 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 8f724c54fc8e9..197b41d02735b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2645,16 +2645,34 @@ static int fiemap_process_hole(struct btrfs_inode *inode, * it beyond i_size. */ while (cur_offset < end && cur_offset < i_size) {
struct extent_state *cached_state = NULL; u64 delalloc_start; u64 delalloc_end; u64 prealloc_start;
u64 lockstart;
u64 lockend; u64 prealloc_len = 0; bool delalloc;
lockstart = round_down(cur_offset, inode->root->fs_info->sectorsize);
lockend = round_up(end, inode->root->fs_info->sectorsize);
/*
* We are only locking for the delalloc range because that's the
* only thing that can change here. With fiemap we have a lock
* on the inode, so no buffered or direct writes can happen.
*
* However mmaps and normal page writeback will cause this to
* change arbitrarily. We have to lock the extent lock here to
* make sure that nobody messes with the tree while we're doing
* btrfs_find_delalloc_in_range.
*/
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state); delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end, delalloc_cached_state, &delalloc_start, &delalloc_end);
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state); if (!delalloc) break;
@@ -2822,15 +2840,15 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len) { const u64 ino = btrfs_ino(inode);
struct extent_state *cached_state = NULL; struct extent_state *delalloc_cached_state = NULL; struct btrfs_path *path; struct fiemap_cache cache = { 0 }; struct btrfs_backref_share_check_ctx *backref_ctx; u64 last_extent_end; u64 prev_extent_end;
u64 lockstart;
u64 lockend;
u64 range_start;
u64 range_end;
const u64 sectorsize = inode->root->fs_info->sectorsize; bool stopped = false; int ret;
@@ -2841,12 +2859,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, goto out; }
lockstart = round_down(start, inode->root->fs_info->sectorsize);
lockend = round_up(start + len, inode->root->fs_info->sectorsize);
prev_extent_end = lockstart;
range_start = round_down(start, sectorsize);
range_end = round_up(start + len, sectorsize);
prev_extent_end = range_start; btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state); ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end); if (ret < 0)
@@ -2854,7 +2871,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, btrfs_release_path(path);
path->reada = READA_FORWARD;
ret = fiemap_search_slot(inode, path, lockstart);
ret = fiemap_search_slot(inode, path, range_start); if (ret < 0) { goto out_unlock; } else if (ret > 0) {
@@ -2866,7 +2883,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, goto check_eof_delalloc; }
while (prev_extent_end < lockend) {
while (prev_extent_end < range_end) { struct extent_buffer *leaf = path->nodes[0]; struct btrfs_file_extent_item *ei; struct btrfs_key key;
@@ -2889,19 +2906,19 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, * The first iteration can leave us at an extent item that ends * before our range's start. Move to the next item. */
if (extent_end <= lockstart)
if (extent_end <= range_start) goto next_item; backref_ctx->curr_leaf_bytenr = leaf->start; /* We have in implicit hole (NO_HOLES feature enabled). */ if (prev_extent_end < key.offset) {
const u64 range_end = min(key.offset, lockend) - 1;
const u64 hole_end = min(key.offset, range_end) - 1; ret = fiemap_process_hole(inode, fieinfo, &cache, &delalloc_cached_state, backref_ctx, 0, 0, 0,
prev_extent_end, range_end);
prev_extent_end, hole_end); if (ret < 0) { goto out_unlock; } else if (ret > 0) {
@@ -2911,7 +2928,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, }
/* We've reached the end of the fiemap range, stop. */
if (key.offset >= lockend) {
if (key.offset >= range_end) { stopped = true; break; }
@@ -3005,29 +3022,41 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, btrfs_free_path(path); path = NULL;
if (!stopped && prev_extent_end < lockend) {
if (!stopped && prev_extent_end < range_end) { ret = fiemap_process_hole(inode, fieinfo, &cache, &delalloc_cached_state, backref_ctx,
0, 0, 0, prev_extent_end, lockend - 1);
0, 0, 0, prev_extent_end, range_end - 1); if (ret < 0) goto out_unlock;
prev_extent_end = lockend;
prev_extent_end = range_end; } if (cache.cached && cache.offset + cache.len >= last_extent_end) { const u64 i_size = i_size_read(&inode->vfs_inode); if (prev_extent_end < i_size) {
struct extent_state *cached_state = NULL; u64 delalloc_start; u64 delalloc_end;
u64 lockstart;
u64 lockend; bool delalloc;
lockstart = round_down(prev_extent_end, sectorsize);
lockend = round_up(i_size, sectorsize);
/*
* See the comment in fiemap_process_hole as to why
* we're doing the locking here.
*/
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state); delalloc = btrfs_find_delalloc_in_range(inode, prev_extent_end, i_size - 1, &delalloc_cached_state, &delalloc_start, &delalloc_end);
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state); if (!delalloc) cache.flags |= FIEMAP_EXTENT_LAST; } else {
@@ -3038,7 +3067,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, ret = emit_last_fiemap_cache(fieinfo, &cache);
out_unlock:
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state); btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
out: free_extent_state(delalloc_cached_state); -- 2.43.0