June 2020 - Linux-stable-mirror

FAILED: patch "[PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with" failed to apply to 4.9-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.9-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Mon, 18 May 2020 12:14:50 +0100 Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents When we have extents shared amongst different inodes in the same subvolume, if we fsync them in parallel we can end up with checksum items in the log tree that represent ranges which overlap. For example, consider we have inodes A and B, both sharing an extent that covers the logical range from X to X + 64KiB: 1) Task A starts an fsync on inode A; 2) Task B starts an fsync on inode B; 3) Task A calls btrfs_csum_file_blocks(), and the first search in the log tree, through btrfs_lookup_csum(), returns -EFBIG because it finds an existing checksum item that covers the range from X - 64KiB to X; 4) Task A checks that the checksum item has not reached the maximum possible size (MAX_CSUM_ITEMS) and then releases the search path before it does another path search for insertion (through a direct call to btrfs_search_slot()); 5) As soon as task A releases the path and before it does the search for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG too, because there is an existing checksum item that has an end offset that matches the start offset (X) of the checksum range we want to log; 6) Task B releases the path; 7) Task A does the path search for insertion (through btrfs_search_slot()) and then verifies that the checksum item that ends at offset X still exists and extends its size to insert the checksums for the range from X to X + 64KiB; 8) Task A releases the path and returns from btrfs_csum_file_blocks(), having inserted the checksums into an existing checksum item that got its size extended. At this point we have one checksum item in the log tree that covers the logical range from X - 64KiB to X + 64KiB; 9) Task B now does a search for insertion using btrfs_search_slot() too, but it finds that the previous checksum item no longer ends at the offset X, it now ends at an of offset X + 64KiB, so it leaves that item untouched. Then it releases the path and calls btrfs_insert_empty_item() that inserts a checksum item with a key offset corresponding to X and a size for inserting a single checksum (4 bytes in case of crc32c). Subsequent iterations end up extending this new checksum item so that it contains the checksums for the range from X to X + 64KiB. So after task B returns from btrfs_csum_file_blocks() we end up with two checksum items in the log tree that have overlapping ranges, one for the range from X - 64KiB to X + 64KiB, and another for the range from X to X + 64KiB. Having checksum items that represent ranges which overlap, regardless of being in the log tree or in the chekcsums tree, can lead to problems where checksums for a file range end up not being found. This type of problem has happened a few times in the past and the following commits fixed them and explain in detail why having checksum items with overlapping ranges is problematic: 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" Since this specific instance of the problem can only happen when logging inodes, because it is the only case where concurrent attempts to insert checksums for the same range can happen, fix the issue by using an extent io tree as a range lock to serialize checksum insertion during inode logging. This issue could often be reproduced by the test case generic/457 from fstests. When it happens it produces the following trace: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 (...) BTRFS error (device dm-0): block=30625792 write time tree block corruption detected ------------[ cut here ]------------ WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] Modules linked in: btrfs dm_thin_pool ... CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] Code: c7 c7 ... RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x67/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] btree_write_cache_pages+0x2db/0x4b0 [btrfs] ? __filemap_fdatawrite_range+0xb1/0x110 do_writepages+0x23/0x80 __filemap_fdatawrite_range+0xd2/0x110 btrfs_write_marked_extents+0x15e/0x180 [btrfs] btrfs_sync_log+0x206/0x10a0 [btrfs] ? kmem_cache_free+0x315/0x3b0 ? btrfs_log_inode+0x1e8/0xf90 [btrfs] ? __mutex_unlock_slowpath+0x45/0x2a0 ? lockref_put_or_lock+0x9/0x30 ? dput+0x2d/0x580 ? dput+0xb5/0x580 ? btrfs_sync_file+0x464/0x4d0 [btrfs] btrfs_sync_file+0x464/0x4d0 [btrfs] do_fsync+0x38/0x60 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb41953a6d0 Code: 48 3d ... RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace d543fc76f5ad7fd8 ]--- In that trace the tree checker detected the overlapping checksum items at the time when we triggered writeback for the log tree when syncing the log. Another trace that can happen is due to BUG_ON() when deleting checksum items while logging an inode: BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 item 0 key (257 1 0) itemoff 16123 itemsize 160 inode generation 7 size 262144 mode 100600 item 1 key (257 12 256) itemoff 16103 itemsize 20 item 2 key (257 108 0) itemoff 16050 itemsize 53 extent data disk bytenr 13631488 nr 4096 extent data offset 0 nr 131072 ram 131072 (...) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] Code: 0f b6 ... RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_del_csums+0x2f4/0x540 [btrfs] copy_items+0x4b5/0x560 [btrfs] btrfs_log_inode+0x910/0xf90 [btrfs] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] ? dget_parent+0x5/0x370 btrfs_log_dentry_safe+0x4a/0x70 [btrfs] btrfs_sync_file+0x42b/0x4d0 [btrfs] __x64_sys_msync+0x199/0x200 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fe586c65760 Code: 00 f7 ... RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 Modules linked in: dm_log_writes ... ---[ end trace c92a7f447a8515f5 ]--- CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5afeb17a3f1a..30ce7039bc27 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1167,6 +1167,9 @@ struct btrfs_root { /* Record pairs of swapped blocks for qgroup */ struct btrfs_qgroup_swapped_blocks swapped_blocks; + /* Used only by log trees, when logging csum items */ + struct extent_io_tree log_csum_range; + #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS u64 alloc_bytenr; #endif diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f2f2864f5978..f8ec2d8606fd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, root->log_transid = 0; root->log_transid_committed = -1; root->last_log_commit = 0; - if (!dummy) + if (!dummy) { extent_io_tree_init(fs_info, &root->dirty_log_pages, IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL); + extent_io_tree_init(fs_info, &root->log_csum_range, + IO_TREE_LOG_CSUM_RANGE, NULL); + } memset(&root->root_key, 0, sizeof(root->root_key)); memset(&root->root_item, 0, sizeof(root->root_item)); diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h index b4a7bad3e82e..b6561455b3c4 100644 --- a/fs/btrfs/extent-io-tree.h +++ b/fs/btrfs/extent-io-tree.h @@ -44,6 +44,7 @@ enum { IO_TREE_TRANS_DIRTY_PAGES, IO_TREE_ROOT_DIRTY_LOG_PAGES, IO_TREE_INODE_FILE_EXTENT, + IO_TREE_LOG_CSUM_RANGE, IO_TREE_SELFTEST, }; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 67fa7087f707..920cee312f4e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans, clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); + extent_io_tree_release(&log->log_csum_range); btrfs_put_root(log); } @@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans, struct btrfs_root *log_root, struct btrfs_ordered_sum *sums) { + const u64 lock_end = sums->bytenr + sums->len - 1; + struct extent_state *cached_state = NULL; int ret; + /* + * Serialize logging for checksums. This is to avoid racing with the + * same checksum being logged by another task that is logging another + * file which happens to refer to the same extent as well. Such races + * can leave checksum items in the log with overlapping ranges. + */ + ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr, + lock_end, &cached_state); + if (ret) + return ret; /* * Due to extent cloning, we might have logged a csum item that covers a * subrange of a cloned extent, and later we can end up logging a csum @@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans, * trim and adjust) any existing csum items in the log for this range. */ ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len); - if (ret) - return ret; + if (!ret) + ret = btrfs_csum_file_blocks(trans, log_root, sums); - return btrfs_csum_file_blocks(trans, log_root, sums); + unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end, + &cached_state); + + return ret; } static noinline int copy_items(struct btrfs_trans_handle *trans, diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index bcbc763b8814..360b0f9d2220 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS); { IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \ { IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \ { IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \ + { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \ { IO_TREE_SELFTEST, "SELFTEST" }) #define BTRFS_GROUP_FLAGS \

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with" failed to apply to 4.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Mon, 18 May 2020 12:14:50 +0100 Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents When we have extents shared amongst different inodes in the same subvolume, if we fsync them in parallel we can end up with checksum items in the log tree that represent ranges which overlap. For example, consider we have inodes A and B, both sharing an extent that covers the logical range from X to X + 64KiB: 1) Task A starts an fsync on inode A; 2) Task B starts an fsync on inode B; 3) Task A calls btrfs_csum_file_blocks(), and the first search in the log tree, through btrfs_lookup_csum(), returns -EFBIG because it finds an existing checksum item that covers the range from X - 64KiB to X; 4) Task A checks that the checksum item has not reached the maximum possible size (MAX_CSUM_ITEMS) and then releases the search path before it does another path search for insertion (through a direct call to btrfs_search_slot()); 5) As soon as task A releases the path and before it does the search for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG too, because there is an existing checksum item that has an end offset that matches the start offset (X) of the checksum range we want to log; 6) Task B releases the path; 7) Task A does the path search for insertion (through btrfs_search_slot()) and then verifies that the checksum item that ends at offset X still exists and extends its size to insert the checksums for the range from X to X + 64KiB; 8) Task A releases the path and returns from btrfs_csum_file_blocks(), having inserted the checksums into an existing checksum item that got its size extended. At this point we have one checksum item in the log tree that covers the logical range from X - 64KiB to X + 64KiB; 9) Task B now does a search for insertion using btrfs_search_slot() too, but it finds that the previous checksum item no longer ends at the offset X, it now ends at an of offset X + 64KiB, so it leaves that item untouched. Then it releases the path and calls btrfs_insert_empty_item() that inserts a checksum item with a key offset corresponding to X and a size for inserting a single checksum (4 bytes in case of crc32c). Subsequent iterations end up extending this new checksum item so that it contains the checksums for the range from X to X + 64KiB. So after task B returns from btrfs_csum_file_blocks() we end up with two checksum items in the log tree that have overlapping ranges, one for the range from X - 64KiB to X + 64KiB, and another for the range from X to X + 64KiB. Having checksum items that represent ranges which overlap, regardless of being in the log tree or in the chekcsums tree, can lead to problems where checksums for a file range end up not being found. This type of problem has happened a few times in the past and the following commits fixed them and explain in detail why having checksum items with overlapping ranges is problematic: 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" Since this specific instance of the problem can only happen when logging inodes, because it is the only case where concurrent attempts to insert checksums for the same range can happen, fix the issue by using an extent io tree as a range lock to serialize checksum insertion during inode logging. This issue could often be reproduced by the test case generic/457 from fstests. When it happens it produces the following trace: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 (...) BTRFS error (device dm-0): block=30625792 write time tree block corruption detected ------------[ cut here ]------------ WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] Modules linked in: btrfs dm_thin_pool ... CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] Code: c7 c7 ... RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x67/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] btree_write_cache_pages+0x2db/0x4b0 [btrfs] ? __filemap_fdatawrite_range+0xb1/0x110 do_writepages+0x23/0x80 __filemap_fdatawrite_range+0xd2/0x110 btrfs_write_marked_extents+0x15e/0x180 [btrfs] btrfs_sync_log+0x206/0x10a0 [btrfs] ? kmem_cache_free+0x315/0x3b0 ? btrfs_log_inode+0x1e8/0xf90 [btrfs] ? __mutex_unlock_slowpath+0x45/0x2a0 ? lockref_put_or_lock+0x9/0x30 ? dput+0x2d/0x580 ? dput+0xb5/0x580 ? btrfs_sync_file+0x464/0x4d0 [btrfs] btrfs_sync_file+0x464/0x4d0 [btrfs] do_fsync+0x38/0x60 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb41953a6d0 Code: 48 3d ... RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace d543fc76f5ad7fd8 ]--- In that trace the tree checker detected the overlapping checksum items at the time when we triggered writeback for the log tree when syncing the log. Another trace that can happen is due to BUG_ON() when deleting checksum items while logging an inode: BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 item 0 key (257 1 0) itemoff 16123 itemsize 160 inode generation 7 size 262144 mode 100600 item 1 key (257 12 256) itemoff 16103 itemsize 20 item 2 key (257 108 0) itemoff 16050 itemsize 53 extent data disk bytenr 13631488 nr 4096 extent data offset 0 nr 131072 ram 131072 (...) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] Code: 0f b6 ... RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_del_csums+0x2f4/0x540 [btrfs] copy_items+0x4b5/0x560 [btrfs] btrfs_log_inode+0x910/0xf90 [btrfs] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] ? dget_parent+0x5/0x370 btrfs_log_dentry_safe+0x4a/0x70 [btrfs] btrfs_sync_file+0x42b/0x4d0 [btrfs] __x64_sys_msync+0x199/0x200 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fe586c65760 Code: 00 f7 ... RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 Modules linked in: dm_log_writes ... ---[ end trace c92a7f447a8515f5 ]--- CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5afeb17a3f1a..30ce7039bc27 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1167,6 +1167,9 @@ struct btrfs_root { /* Record pairs of swapped blocks for qgroup */ struct btrfs_qgroup_swapped_blocks swapped_blocks; + /* Used only by log trees, when logging csum items */ + struct extent_io_tree log_csum_range; + #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS u64 alloc_bytenr; #endif diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f2f2864f5978..f8ec2d8606fd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, root->log_transid = 0; root->log_transid_committed = -1; root->last_log_commit = 0; - if (!dummy) + if (!dummy) { extent_io_tree_init(fs_info, &root->dirty_log_pages, IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL); + extent_io_tree_init(fs_info, &root->log_csum_range, + IO_TREE_LOG_CSUM_RANGE, NULL); + } memset(&root->root_key, 0, sizeof(root->root_key)); memset(&root->root_item, 0, sizeof(root->root_item)); diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h index b4a7bad3e82e..b6561455b3c4 100644 --- a/fs/btrfs/extent-io-tree.h +++ b/fs/btrfs/extent-io-tree.h @@ -44,6 +44,7 @@ enum { IO_TREE_TRANS_DIRTY_PAGES, IO_TREE_ROOT_DIRTY_LOG_PAGES, IO_TREE_INODE_FILE_EXTENT, + IO_TREE_LOG_CSUM_RANGE, IO_TREE_SELFTEST, }; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 67fa7087f707..920cee312f4e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans, clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); + extent_io_tree_release(&log->log_csum_range); btrfs_put_root(log); } @@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans, struct btrfs_root *log_root, struct btrfs_ordered_sum *sums) { + const u64 lock_end = sums->bytenr + sums->len - 1; + struct extent_state *cached_state = NULL; int ret; + /* + * Serialize logging for checksums. This is to avoid racing with the + * same checksum being logged by another task that is logging another + * file which happens to refer to the same extent as well. Such races + * can leave checksum items in the log with overlapping ranges. + */ + ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr, + lock_end, &cached_state); + if (ret) + return ret; /* * Due to extent cloning, we might have logged a csum item that covers a * subrange of a cloned extent, and later we can end up logging a csum @@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans, * trim and adjust) any existing csum items in the log for this range. */ ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len); - if (ret) - return ret; + if (!ret) + ret = btrfs_csum_file_blocks(trans, log_root, sums); - return btrfs_csum_file_blocks(trans, log_root, sums); + unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end, + &cached_state); + + return ret; } static noinline int copy_items(struct btrfs_trans_handle *trans, diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index bcbc763b8814..360b0f9d2220 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS); { IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \ { IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \ { IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \ + { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \ { IO_TREE_SELFTEST, "SELFTEST" }) #define BTRFS_GROUP_FLAGS \

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with" failed to apply to 5.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Mon, 18 May 2020 12:14:50 +0100 Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents When we have extents shared amongst different inodes in the same subvolume, if we fsync them in parallel we can end up with checksum items in the log tree that represent ranges which overlap. For example, consider we have inodes A and B, both sharing an extent that covers the logical range from X to X + 64KiB: 1) Task A starts an fsync on inode A; 2) Task B starts an fsync on inode B; 3) Task A calls btrfs_csum_file_blocks(), and the first search in the log tree, through btrfs_lookup_csum(), returns -EFBIG because it finds an existing checksum item that covers the range from X - 64KiB to X; 4) Task A checks that the checksum item has not reached the maximum possible size (MAX_CSUM_ITEMS) and then releases the search path before it does another path search for insertion (through a direct call to btrfs_search_slot()); 5) As soon as task A releases the path and before it does the search for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG too, because there is an existing checksum item that has an end offset that matches the start offset (X) of the checksum range we want to log; 6) Task B releases the path; 7) Task A does the path search for insertion (through btrfs_search_slot()) and then verifies that the checksum item that ends at offset X still exists and extends its size to insert the checksums for the range from X to X + 64KiB; 8) Task A releases the path and returns from btrfs_csum_file_blocks(), having inserted the checksums into an existing checksum item that got its size extended. At this point we have one checksum item in the log tree that covers the logical range from X - 64KiB to X + 64KiB; 9) Task B now does a search for insertion using btrfs_search_slot() too, but it finds that the previous checksum item no longer ends at the offset X, it now ends at an of offset X + 64KiB, so it leaves that item untouched. Then it releases the path and calls btrfs_insert_empty_item() that inserts a checksum item with a key offset corresponding to X and a size for inserting a single checksum (4 bytes in case of crc32c). Subsequent iterations end up extending this new checksum item so that it contains the checksums for the range from X to X + 64KiB. So after task B returns from btrfs_csum_file_blocks() we end up with two checksum items in the log tree that have overlapping ranges, one for the range from X - 64KiB to X + 64KiB, and another for the range from X to X + 64KiB. Having checksum items that represent ranges which overlap, regardless of being in the log tree or in the chekcsums tree, can lead to problems where checksums for a file range end up not being found. This type of problem has happened a few times in the past and the following commits fixed them and explain in detail why having checksum items with overlapping ranges is problematic: 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" Since this specific instance of the problem can only happen when logging inodes, because it is the only case where concurrent attempts to insert checksums for the same range can happen, fix the issue by using an extent io tree as a range lock to serialize checksum insertion during inode logging. This issue could often be reproduced by the test case generic/457 from fstests. When it happens it produces the following trace: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 (...) BTRFS error (device dm-0): block=30625792 write time tree block corruption detected ------------[ cut here ]------------ WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] Modules linked in: btrfs dm_thin_pool ... CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] Code: c7 c7 ... RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x67/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] btree_write_cache_pages+0x2db/0x4b0 [btrfs] ? __filemap_fdatawrite_range+0xb1/0x110 do_writepages+0x23/0x80 __filemap_fdatawrite_range+0xd2/0x110 btrfs_write_marked_extents+0x15e/0x180 [btrfs] btrfs_sync_log+0x206/0x10a0 [btrfs] ? kmem_cache_free+0x315/0x3b0 ? btrfs_log_inode+0x1e8/0xf90 [btrfs] ? __mutex_unlock_slowpath+0x45/0x2a0 ? lockref_put_or_lock+0x9/0x30 ? dput+0x2d/0x580 ? dput+0xb5/0x580 ? btrfs_sync_file+0x464/0x4d0 [btrfs] btrfs_sync_file+0x464/0x4d0 [btrfs] do_fsync+0x38/0x60 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb41953a6d0 Code: 48 3d ... RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace d543fc76f5ad7fd8 ]--- In that trace the tree checker detected the overlapping checksum items at the time when we triggered writeback for the log tree when syncing the log. Another trace that can happen is due to BUG_ON() when deleting checksum items while logging an inode: BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 item 0 key (257 1 0) itemoff 16123 itemsize 160 inode generation 7 size 262144 mode 100600 item 1 key (257 12 256) itemoff 16103 itemsize 20 item 2 key (257 108 0) itemoff 16050 itemsize 53 extent data disk bytenr 13631488 nr 4096 extent data offset 0 nr 131072 ram 131072 (...) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] Code: 0f b6 ... RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_del_csums+0x2f4/0x540 [btrfs] copy_items+0x4b5/0x560 [btrfs] btrfs_log_inode+0x910/0xf90 [btrfs] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] ? dget_parent+0x5/0x370 btrfs_log_dentry_safe+0x4a/0x70 [btrfs] btrfs_sync_file+0x42b/0x4d0 [btrfs] __x64_sys_msync+0x199/0x200 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fe586c65760 Code: 00 f7 ... RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 Modules linked in: dm_log_writes ... ---[ end trace c92a7f447a8515f5 ]--- CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5afeb17a3f1a..30ce7039bc27 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1167,6 +1167,9 @@ struct btrfs_root { /* Record pairs of swapped blocks for qgroup */ struct btrfs_qgroup_swapped_blocks swapped_blocks; + /* Used only by log trees, when logging csum items */ + struct extent_io_tree log_csum_range; + #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS u64 alloc_bytenr; #endif diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f2f2864f5978..f8ec2d8606fd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, root->log_transid = 0; root->log_transid_committed = -1; root->last_log_commit = 0; - if (!dummy) + if (!dummy) { extent_io_tree_init(fs_info, &root->dirty_log_pages, IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL); + extent_io_tree_init(fs_info, &root->log_csum_range, + IO_TREE_LOG_CSUM_RANGE, NULL); + } memset(&root->root_key, 0, sizeof(root->root_key)); memset(&root->root_item, 0, sizeof(root->root_item)); diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h index b4a7bad3e82e..b6561455b3c4 100644 --- a/fs/btrfs/extent-io-tree.h +++ b/fs/btrfs/extent-io-tree.h @@ -44,6 +44,7 @@ enum { IO_TREE_TRANS_DIRTY_PAGES, IO_TREE_ROOT_DIRTY_LOG_PAGES, IO_TREE_INODE_FILE_EXTENT, + IO_TREE_LOG_CSUM_RANGE, IO_TREE_SELFTEST, }; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 67fa7087f707..920cee312f4e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans, clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); + extent_io_tree_release(&log->log_csum_range); btrfs_put_root(log); } @@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans, struct btrfs_root *log_root, struct btrfs_ordered_sum *sums) { + const u64 lock_end = sums->bytenr + sums->len - 1; + struct extent_state *cached_state = NULL; int ret; + /* + * Serialize logging for checksums. This is to avoid racing with the + * same checksum being logged by another task that is logging another + * file which happens to refer to the same extent as well. Such races + * can leave checksum items in the log with overlapping ranges. + */ + ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr, + lock_end, &cached_state); + if (ret) + return ret; /* * Due to extent cloning, we might have logged a csum item that covers a * subrange of a cloned extent, and later we can end up logging a csum @@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans, * trim and adjust) any existing csum items in the log for this range. */ ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len); - if (ret) - return ret; + if (!ret) + ret = btrfs_csum_file_blocks(trans, log_root, sums); - return btrfs_csum_file_blocks(trans, log_root, sums); + unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end, + &cached_state); + + return ret; } static noinline int copy_items(struct btrfs_trans_handle *trans, diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index bcbc763b8814..360b0f9d2220 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS); { IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \ { IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \ { IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \ + { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \ { IO_TREE_SELFTEST, "SELFTEST" }) #define BTRFS_GROUP_FLAGS \

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix error handling when submitting direct I/O bio" failed to apply to 4.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001 From: Omar Sandoval <osandov(a)fb.com> Date: Thu, 16 Apr 2020 14:46:12 -0700 Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable(a)vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov(a)suse.com> Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 259239b33370..b628c319a5b6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) /* bio split */ ASSERT(geom.len <= INT_MAX); - atomic_inc(&dip->pending_bios); do { clone_len = min_t(int, submit_len, geom.len); @@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) if (!status) return 0; - bio_put(bio); + if (bio != orig_bio) + bio_put(bio); out_err: dip->errors = 1; /* @@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio->bi_private = dip; dip->orig_bio = bio; dip->dio_bio = dio_bio; - atomic_set(&dip->pending_bios, 0); + atomic_set(&dip->pending_bios, 1); io_bio = btrfs_io_bio(bio); io_bio->logical = file_offset;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix error handling when submitting direct I/O bio" failed to apply to 4.9-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.9-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001 From: Omar Sandoval <osandov(a)fb.com> Date: Thu, 16 Apr 2020 14:46:12 -0700 Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable(a)vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov(a)suse.com> Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 259239b33370..b628c319a5b6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) /* bio split */ ASSERT(geom.len <= INT_MAX); - atomic_inc(&dip->pending_bios); do { clone_len = min_t(int, submit_len, geom.len); @@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) if (!status) return 0; - bio_put(bio); + if (bio != orig_bio) + bio_put(bio); out_err: dip->errors = 1; /* @@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio->bi_private = dip; dip->orig_bio = bio; dip->dio_bio = dio_bio; - atomic_set(&dip->pending_bios, 0); + atomic_set(&dip->pending_bios, 1); io_bio = btrfs_io_bio(bio); io_bio->logical = file_offset;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix error handling when submitting direct I/O bio" failed to apply to 4.14-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.14-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001 From: Omar Sandoval <osandov(a)fb.com> Date: Thu, 16 Apr 2020 14:46:12 -0700 Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable(a)vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov(a)suse.com> Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 259239b33370..b628c319a5b6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) /* bio split */ ASSERT(geom.len <= INT_MAX); - atomic_inc(&dip->pending_bios); do { clone_len = min_t(int, submit_len, geom.len); @@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) if (!status) return 0; - bio_put(bio); + if (bio != orig_bio) + bio_put(bio); out_err: dip->errors = 1; /* @@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio->bi_private = dip; dip->orig_bio = bio; dip->dio_bio = dio_bio; - atomic_set(&dip->pending_bios, 0); + atomic_set(&dip->pending_bios, 1); io_bio = btrfs_io_bio(bio); io_bio->logical = file_offset;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix error handling when submitting direct I/O bio" failed to apply to 4.19-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.19-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001 From: Omar Sandoval <osandov(a)fb.com> Date: Thu, 16 Apr 2020 14:46:12 -0700 Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable(a)vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov(a)suse.com> Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 259239b33370..b628c319a5b6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) /* bio split */ ASSERT(geom.len <= INT_MAX); - atomic_inc(&dip->pending_bios); do { clone_len = min_t(int, submit_len, geom.len); @@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) if (!status) return 0; - bio_put(bio); + if (bio != orig_bio) + bio_put(bio); out_err: dip->errors = 1; /* @@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio->bi_private = dip; dip->orig_bio = bio; dip->dio_bio = dio_bio; - atomic_set(&dip->pending_bios, 0); + atomic_set(&dip->pending_bios, 1); io_bio = btrfs_io_bio(bio); io_bio->logical = file_offset;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix a race between scrub and block group" failed to apply to 4.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Fri, 8 May 2020 11:01:10 +0100 Subject: [PATCH] btrfs: fix a race between scrub and block group removal/allocation When scrub is verifying the extents of a block group for a device, it is possible that the corresponding block group gets removed and its logical address and device extents get used for a new block group allocation. When this happens scrub incorrectly reports that errors were detected and, if the the new block group has a different profile then the old one, deleted block group, we can crash due to a null pointer dereference. Possibly other unexpected and weird consequences can happen as well. Consider the following sequence of actions that leads to the null pointer dereference crash when scrub is running in parallel with balance: 1) Balance sets block group X to read-only mode and starts relocating it. Block group X is a metadata block group, has a raid1 profile (two device extents, each one in a different device) and a logical address of 19424870400; 2) Scrub is running and finds device extent E, which belongs to block group X. It enters scrub_stripe() to find all extents allocated to block group X, the search is done using the extent tree; 3) Balance finishes relocating block group X and removes block group X; 4) Balance starts relocating another block group and when trying to commit the current transaction as part of the preparation step (prepare_to_relocate()), it blocks because scrub is running; 5) The scrub task finds the metadata extent at the logical address 19425001472 and marks the pages of the extent to be read by a bio (struct scrub_bio). The extent item's flags, which have the bit BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct scrub_page). It is these flags in the scrub pages that tells the bio's end io function (scrub_bio_end_io_worker) which type of extent it is dealing with. At this point we end up with 4 pages in a bio which is ready for submission (the metadata extent has a size of 16Kb, so that gives 4 pages on x86); 6) At the next iteration of scrub_stripe(), scrub checks that there is a pause request from the relocation task trying to commit a transaction, therefore it submits the pending bio and pauses, waiting for the transaction commit to complete before resuming; 7) The relocation task commits the transaction. The device extent E, that was used by our block group X, is now available for allocation, since the commit root for the device tree was swapped by the transaction commit; 8) Another task doing a direct IO write allocates a new data block group Y which ends using device extent E. This new block group Y also ends up getting the same logical address that block group X had: 19424870400. This happens because block group X was the block group with the highest logical address and, when allocating Y, find_next_chunk() returns the end offset of the current last block group to be used as the logical address for the new block group, which is 18351128576 + 1073741824 = 19424870400 So our new block group Y has the same logical address and device extent that block group X had. However Y is a data block group, while X was a metadata one, and Y has a raid0 profile, while X had a raid1 profile; 9) After allocating block group Y, the direct IO submits a bio to write to device extent E; 10) The read bio submitted by scrub reads the 4 pages (16Kb) from device extent E, which now correspond to the data written by the task that did a direct IO write. Then at the end io function associated with the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() which calls scrub_checksum(). This later function checks the flags of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK is set in the flags, so it assumes it has a metadata extent and then calls scrub_checksum_tree_block(). That functions returns an error, since interpreting data as a metadata extent causes the checksum verification to fail. So this makes scrub_checksum() call scrub_handle_errored_block(), which determines 'failed_mirror_index' to be 1, since the device extent E was allocated as the second mirror of block group X. It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at 'sblocks_for_recheck', and all the memory is initialized to zeroes by kcalloc(). After that it calls scrub_setup_recheck_block(), which is responsible for filling each of those structures. However, when that function calls btrfs_map_sblock() against the logical address of the metadata extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches the current block group Y. However block group Y has a raid0 profile and not a raid1 profile like X had, so the following call returns 1: scrub_nr_raid_mirrors(bbio) And as a result scrub_setup_recheck_block() only initializes the first (index 0) scrub_block structure in 'sblocks_for_recheck'. Then scrub_recheck_block() is called by scrub_handle_errored_block() with the second (index 1) scrub_block structure as the argument, because 'failed_mirror_index' was previously set to 1. This scrub_block was not initialized by scrub_setup_recheck_block(), so it has zero pages, its 'page_count' member is 0 and its 'pagev' page array has all members pointing to NULL. Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() we have a NULL pointer dereference when accessing the flags of the first page, as pavev[0] is NULL: static void scrub_recheck_block_checksum(struct scrub_block *sblock) { (...) if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) scrub_checksum_data(sblock); (...) } Producing a stack trace like the following: [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 [542998.010238] #PF: supervisor read access in kernel mode [542998.010878] #PF: error_code(0x0000) - not-present page [542998.011516] PGD 0 P4D 0 [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] [542998.018474] Code: 4c 89 e6 ... [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [542998.032380] Call Trace: [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] [542998.036735] process_one_work+0x26d/0x6a0 [542998.037275] worker_thread+0x4f/0x3e0 [542998.037740] ? process_one_work+0x6a0/0x6a0 [542998.038378] kthread+0x103/0x140 [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 [542998.039419] ret_from_fork+0x3a/0x50 [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... [542998.047288] CR2: 0000000000000028 [542998.047724] ---[ end trace bde186e176c7f96a ]--- This issue has been around for a long time, possibly since scrub exists. The last time I ran into it was over 2 years ago. After recently fixing fstests to pass the "--full-balance" command line option to btrfs-progs when doing balance, several tests started to more heavily exercise balance with fsstress, scrub and other operations in parallel, and therefore started to hit this issue again (with btrfs/061 for example). Fix this by having scrub increment the 'trimming' counter of the block group, which pins the block group in such a way that it guarantees neither its logical address nor device extents can be reused by future block group allocations until we decrement the 'trimming' counter. Also make sure that on each iteration of scrub_stripe() we stop scrubbing the block group if it was removed already. A later patch in the series will rename the block group's 'trimming' counter and its helpers to a more generic name, since now it is not used exclusively for pinning while trimming anymore. CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index adaf8ab694d5..7c50ac5b6876 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, - int num, u64 base, u64 length) + int num, u64 base, u64 length, + struct btrfs_block_group *cache) { struct btrfs_path *path, *ppath; struct btrfs_fs_info *fs_info = sctx->fs_info; @@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, break; } + /* + * If our block group was removed in the meanwhile, just + * stop scrubbing since there is no point in continuing. + * Continuing would prevent reusing its device extents + * for new block groups for a long time. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + ret = 0; + goto out; + } + spin_unlock(&cache->lock); + extent = btrfs_item_ptr(l, slot, struct btrfs_extent_item); flags = btrfs_extent_flags(l, extent); @@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, if (map->stripes[i].dev->bdev == scrub_dev->bdev && map->stripes[i].physical == dev_offset) { ret = scrub_stripe(sctx, map, scrub_dev, i, - chunk_offset, length); + chunk_offset, length, cache); if (ret) goto out; } @@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + /* + * Make sure that while we are scrubbing the corresponding block + * group doesn't get its logical address and its device extents + * reused for another block group, which can possibly be of a + * different type and different profile. We do this to prevent + * false error detections and crashes due to bogus attempts to + * repair extents. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + goto skip; + } + btrfs_get_block_group_trimming(cache); + spin_unlock(&cache->lock); + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, } else { btrfs_warn(fs_info, "failed setting block group ro: %d", ret); + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); scrub_pause_off(fs_info); break; @@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, spin_unlock(&cache->lock); } + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); if (ret) break;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix a race between scrub and block group" failed to apply to 4.9-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.9-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Fri, 8 May 2020 11:01:10 +0100 Subject: [PATCH] btrfs: fix a race between scrub and block group removal/allocation When scrub is verifying the extents of a block group for a device, it is possible that the corresponding block group gets removed and its logical address and device extents get used for a new block group allocation. When this happens scrub incorrectly reports that errors were detected and, if the the new block group has a different profile then the old one, deleted block group, we can crash due to a null pointer dereference. Possibly other unexpected and weird consequences can happen as well. Consider the following sequence of actions that leads to the null pointer dereference crash when scrub is running in parallel with balance: 1) Balance sets block group X to read-only mode and starts relocating it. Block group X is a metadata block group, has a raid1 profile (two device extents, each one in a different device) and a logical address of 19424870400; 2) Scrub is running and finds device extent E, which belongs to block group X. It enters scrub_stripe() to find all extents allocated to block group X, the search is done using the extent tree; 3) Balance finishes relocating block group X and removes block group X; 4) Balance starts relocating another block group and when trying to commit the current transaction as part of the preparation step (prepare_to_relocate()), it blocks because scrub is running; 5) The scrub task finds the metadata extent at the logical address 19425001472 and marks the pages of the extent to be read by a bio (struct scrub_bio). The extent item's flags, which have the bit BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct scrub_page). It is these flags in the scrub pages that tells the bio's end io function (scrub_bio_end_io_worker) which type of extent it is dealing with. At this point we end up with 4 pages in a bio which is ready for submission (the metadata extent has a size of 16Kb, so that gives 4 pages on x86); 6) At the next iteration of scrub_stripe(), scrub checks that there is a pause request from the relocation task trying to commit a transaction, therefore it submits the pending bio and pauses, waiting for the transaction commit to complete before resuming; 7) The relocation task commits the transaction. The device extent E, that was used by our block group X, is now available for allocation, since the commit root for the device tree was swapped by the transaction commit; 8) Another task doing a direct IO write allocates a new data block group Y which ends using device extent E. This new block group Y also ends up getting the same logical address that block group X had: 19424870400. This happens because block group X was the block group with the highest logical address and, when allocating Y, find_next_chunk() returns the end offset of the current last block group to be used as the logical address for the new block group, which is 18351128576 + 1073741824 = 19424870400 So our new block group Y has the same logical address and device extent that block group X had. However Y is a data block group, while X was a metadata one, and Y has a raid0 profile, while X had a raid1 profile; 9) After allocating block group Y, the direct IO submits a bio to write to device extent E; 10) The read bio submitted by scrub reads the 4 pages (16Kb) from device extent E, which now correspond to the data written by the task that did a direct IO write. Then at the end io function associated with the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() which calls scrub_checksum(). This later function checks the flags of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK is set in the flags, so it assumes it has a metadata extent and then calls scrub_checksum_tree_block(). That functions returns an error, since interpreting data as a metadata extent causes the checksum verification to fail. So this makes scrub_checksum() call scrub_handle_errored_block(), which determines 'failed_mirror_index' to be 1, since the device extent E was allocated as the second mirror of block group X. It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at 'sblocks_for_recheck', and all the memory is initialized to zeroes by kcalloc(). After that it calls scrub_setup_recheck_block(), which is responsible for filling each of those structures. However, when that function calls btrfs_map_sblock() against the logical address of the metadata extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches the current block group Y. However block group Y has a raid0 profile and not a raid1 profile like X had, so the following call returns 1: scrub_nr_raid_mirrors(bbio) And as a result scrub_setup_recheck_block() only initializes the first (index 0) scrub_block structure in 'sblocks_for_recheck'. Then scrub_recheck_block() is called by scrub_handle_errored_block() with the second (index 1) scrub_block structure as the argument, because 'failed_mirror_index' was previously set to 1. This scrub_block was not initialized by scrub_setup_recheck_block(), so it has zero pages, its 'page_count' member is 0 and its 'pagev' page array has all members pointing to NULL. Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() we have a NULL pointer dereference when accessing the flags of the first page, as pavev[0] is NULL: static void scrub_recheck_block_checksum(struct scrub_block *sblock) { (...) if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) scrub_checksum_data(sblock); (...) } Producing a stack trace like the following: [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 [542998.010238] #PF: supervisor read access in kernel mode [542998.010878] #PF: error_code(0x0000) - not-present page [542998.011516] PGD 0 P4D 0 [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] [542998.018474] Code: 4c 89 e6 ... [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [542998.032380] Call Trace: [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] [542998.036735] process_one_work+0x26d/0x6a0 [542998.037275] worker_thread+0x4f/0x3e0 [542998.037740] ? process_one_work+0x6a0/0x6a0 [542998.038378] kthread+0x103/0x140 [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 [542998.039419] ret_from_fork+0x3a/0x50 [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... [542998.047288] CR2: 0000000000000028 [542998.047724] ---[ end trace bde186e176c7f96a ]--- This issue has been around for a long time, possibly since scrub exists. The last time I ran into it was over 2 years ago. After recently fixing fstests to pass the "--full-balance" command line option to btrfs-progs when doing balance, several tests started to more heavily exercise balance with fsstress, scrub and other operations in parallel, and therefore started to hit this issue again (with btrfs/061 for example). Fix this by having scrub increment the 'trimming' counter of the block group, which pins the block group in such a way that it guarantees neither its logical address nor device extents can be reused by future block group allocations until we decrement the 'trimming' counter. Also make sure that on each iteration of scrub_stripe() we stop scrubbing the block group if it was removed already. A later patch in the series will rename the block group's 'trimming' counter and its helpers to a more generic name, since now it is not used exclusively for pinning while trimming anymore. CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index adaf8ab694d5..7c50ac5b6876 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, - int num, u64 base, u64 length) + int num, u64 base, u64 length, + struct btrfs_block_group *cache) { struct btrfs_path *path, *ppath; struct btrfs_fs_info *fs_info = sctx->fs_info; @@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, break; } + /* + * If our block group was removed in the meanwhile, just + * stop scrubbing since there is no point in continuing. + * Continuing would prevent reusing its device extents + * for new block groups for a long time. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + ret = 0; + goto out; + } + spin_unlock(&cache->lock); + extent = btrfs_item_ptr(l, slot, struct btrfs_extent_item); flags = btrfs_extent_flags(l, extent); @@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, if (map->stripes[i].dev->bdev == scrub_dev->bdev && map->stripes[i].physical == dev_offset) { ret = scrub_stripe(sctx, map, scrub_dev, i, - chunk_offset, length); + chunk_offset, length, cache); if (ret) goto out; } @@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + /* + * Make sure that while we are scrubbing the corresponding block + * group doesn't get its logical address and its device extents + * reused for another block group, which can possibly be of a + * different type and different profile. We do this to prevent + * false error detections and crashes due to bogus attempts to + * repair extents. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + goto skip; + } + btrfs_get_block_group_trimming(cache); + spin_unlock(&cache->lock); + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, } else { btrfs_warn(fs_info, "failed setting block group ro: %d", ret); + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); scrub_pause_off(fs_info); break; @@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, spin_unlock(&cache->lock); } + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); if (ret) break;

5 years, 6 months

1
0
0 0

FAILED: patch "[PATCH] btrfs: fix a race between scrub and block group" failed to apply to 4.14-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.14-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001 From: Filipe Manana <fdmanana(a)suse.com> Date: Fri, 8 May 2020 11:01:10 +0100 Subject: [PATCH] btrfs: fix a race between scrub and block group removal/allocation When scrub is verifying the extents of a block group for a device, it is possible that the corresponding block group gets removed and its logical address and device extents get used for a new block group allocation. When this happens scrub incorrectly reports that errors were detected and, if the the new block group has a different profile then the old one, deleted block group, we can crash due to a null pointer dereference. Possibly other unexpected and weird consequences can happen as well. Consider the following sequence of actions that leads to the null pointer dereference crash when scrub is running in parallel with balance: 1) Balance sets block group X to read-only mode and starts relocating it. Block group X is a metadata block group, has a raid1 profile (two device extents, each one in a different device) and a logical address of 19424870400; 2) Scrub is running and finds device extent E, which belongs to block group X. It enters scrub_stripe() to find all extents allocated to block group X, the search is done using the extent tree; 3) Balance finishes relocating block group X and removes block group X; 4) Balance starts relocating another block group and when trying to commit the current transaction as part of the preparation step (prepare_to_relocate()), it blocks because scrub is running; 5) The scrub task finds the metadata extent at the logical address 19425001472 and marks the pages of the extent to be read by a bio (struct scrub_bio). The extent item's flags, which have the bit BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct scrub_page). It is these flags in the scrub pages that tells the bio's end io function (scrub_bio_end_io_worker) which type of extent it is dealing with. At this point we end up with 4 pages in a bio which is ready for submission (the metadata extent has a size of 16Kb, so that gives 4 pages on x86); 6) At the next iteration of scrub_stripe(), scrub checks that there is a pause request from the relocation task trying to commit a transaction, therefore it submits the pending bio and pauses, waiting for the transaction commit to complete before resuming; 7) The relocation task commits the transaction. The device extent E, that was used by our block group X, is now available for allocation, since the commit root for the device tree was swapped by the transaction commit; 8) Another task doing a direct IO write allocates a new data block group Y which ends using device extent E. This new block group Y also ends up getting the same logical address that block group X had: 19424870400. This happens because block group X was the block group with the highest logical address and, when allocating Y, find_next_chunk() returns the end offset of the current last block group to be used as the logical address for the new block group, which is 18351128576 + 1073741824 = 19424870400 So our new block group Y has the same logical address and device extent that block group X had. However Y is a data block group, while X was a metadata one, and Y has a raid0 profile, while X had a raid1 profile; 9) After allocating block group Y, the direct IO submits a bio to write to device extent E; 10) The read bio submitted by scrub reads the 4 pages (16Kb) from device extent E, which now correspond to the data written by the task that did a direct IO write. Then at the end io function associated with the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() which calls scrub_checksum(). This later function checks the flags of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK is set in the flags, so it assumes it has a metadata extent and then calls scrub_checksum_tree_block(). That functions returns an error, since interpreting data as a metadata extent causes the checksum verification to fail. So this makes scrub_checksum() call scrub_handle_errored_block(), which determines 'failed_mirror_index' to be 1, since the device extent E was allocated as the second mirror of block group X. It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at 'sblocks_for_recheck', and all the memory is initialized to zeroes by kcalloc(). After that it calls scrub_setup_recheck_block(), which is responsible for filling each of those structures. However, when that function calls btrfs_map_sblock() against the logical address of the metadata extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches the current block group Y. However block group Y has a raid0 profile and not a raid1 profile like X had, so the following call returns 1: scrub_nr_raid_mirrors(bbio) And as a result scrub_setup_recheck_block() only initializes the first (index 0) scrub_block structure in 'sblocks_for_recheck'. Then scrub_recheck_block() is called by scrub_handle_errored_block() with the second (index 1) scrub_block structure as the argument, because 'failed_mirror_index' was previously set to 1. This scrub_block was not initialized by scrub_setup_recheck_block(), so it has zero pages, its 'page_count' member is 0 and its 'pagev' page array has all members pointing to NULL. Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() we have a NULL pointer dereference when accessing the flags of the first page, as pavev[0] is NULL: static void scrub_recheck_block_checksum(struct scrub_block *sblock) { (...) if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) scrub_checksum_data(sblock); (...) } Producing a stack trace like the following: [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 [542998.010238] #PF: supervisor read access in kernel mode [542998.010878] #PF: error_code(0x0000) - not-present page [542998.011516] PGD 0 P4D 0 [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] [542998.018474] Code: 4c 89 e6 ... [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [542998.032380] Call Trace: [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] [542998.036735] process_one_work+0x26d/0x6a0 [542998.037275] worker_thread+0x4f/0x3e0 [542998.037740] ? process_one_work+0x6a0/0x6a0 [542998.038378] kthread+0x103/0x140 [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 [542998.039419] ret_from_fork+0x3a/0x50 [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... [542998.047288] CR2: 0000000000000028 [542998.047724] ---[ end trace bde186e176c7f96a ]--- This issue has been around for a long time, possibly since scrub exists. The last time I ran into it was over 2 years ago. After recently fixing fstests to pass the "--full-balance" command line option to btrfs-progs when doing balance, several tests started to more heavily exercise balance with fsstress, scrub and other operations in parallel, and therefore started to hit this issue again (with btrfs/061 for example). Fix this by having scrub increment the 'trimming' counter of the block group, which pins the block group in such a way that it guarantees neither its logical address nor device extents can be reused by future block group allocations until we decrement the 'trimming' counter. Also make sure that on each iteration of scrub_stripe() we stop scrubbing the block group if it was removed already. A later patch in the series will rename the block group's 'trimming' counter and its helpers to a more generic name, since now it is not used exclusively for pinning while trimming anymore. CC: stable(a)vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index adaf8ab694d5..7c50ac5b6876 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, - int num, u64 base, u64 length) + int num, u64 base, u64 length, + struct btrfs_block_group *cache) { struct btrfs_path *path, *ppath; struct btrfs_fs_info *fs_info = sctx->fs_info; @@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, break; } + /* + * If our block group was removed in the meanwhile, just + * stop scrubbing since there is no point in continuing. + * Continuing would prevent reusing its device extents + * for new block groups for a long time. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + ret = 0; + goto out; + } + spin_unlock(&cache->lock); + extent = btrfs_item_ptr(l, slot, struct btrfs_extent_item); flags = btrfs_extent_flags(l, extent); @@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, if (map->stripes[i].dev->bdev == scrub_dev->bdev && map->stripes[i].physical == dev_offset) { ret = scrub_stripe(sctx, map, scrub_dev, i, - chunk_offset, length); + chunk_offset, length, cache); if (ret) goto out; } @@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + /* + * Make sure that while we are scrubbing the corresponding block + * group doesn't get its logical address and its device extents + * reused for another block group, which can possibly be of a + * different type and different profile. We do this to prevent + * false error detections and crashes due to bogus attempts to + * repair extents. + */ + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + goto skip; + } + btrfs_get_block_group_trimming(cache); + spin_unlock(&cache->lock); + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, } else { btrfs_warn(fs_info, "failed setting block group ro: %d", ret); + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); scrub_pause_off(fs_info); break; @@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, spin_unlock(&cache->lock); } + btrfs_put_block_group_trimming(cache); btrfs_put_block_group(cache); if (ret) break;

5 years, 6 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror June 2020