The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x edcf9725150e42beeca42d085149f4c88fa97afd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012701-catfight-regular-bc98@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
edcf9725150e ("nfsd: fix RELEASE_LOCKOWNER")
043862b09cc0 ("NFSD: Add documenting comment for nfsd4_release_lockowner()")
bd8fdb6e545f ("NFSD: Modernize nfsd4_release_lockowner()")
ce3c4ad7f4ce ("NFSD: Fix possible sleep during nfsd4_release_lockowner()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From edcf9725150e42beeca42d085149f4c88fa97afd Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb(a)suse.de>
Date: Mon, 22 Jan 2024 14:58:16 +1100
Subject: [PATCH] nfsd: fix RELEASE_LOCKOWNER
The test on so_count in nfsd4_release_lockowner() is nonsense and
harmful. Revert to using check_for_locks(), changing that to not sleep.
First: harmful.
As is documented in the kdoc comment for nfsd4_release_lockowner(), the
test on so_count can transiently return a false positive resulting in a
return of NFS4ERR_LOCKS_HELD when in fact no locks are held. This is
clearly a protocol violation and with the Linux NFS client it can cause
incorrect behaviour.
If RELEASE_LOCKOWNER is sent while some other thread is still
processing a LOCK request which failed because, at the time that request
was received, the given owner held a conflicting lock, then the nfsd
thread processing that LOCK request can hold a reference (conflock) to
the lock owner that causes nfsd4_release_lockowner() to return an
incorrect error.
The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it
never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so
it knows that the error is impossible. It assumes the lock owner was in
fact released so it feels free to use the same lock owner identifier in
some later locking request.
When it does reuse a lock owner identifier for which a previous RELEASE
failed, it will naturally use a lock_seqid of zero. However the server,
which didn't release the lock owner, will expect a larger lock_seqid and
so will respond with NFS4ERR_BAD_SEQID.
So clearly it is harmful to allow a false positive, which testing
so_count allows.
The test is nonsense because ... well... it doesn't mean anything.
so_count is the sum of three different counts.
1/ the set of states listed on so_stateids
2/ the set of active vfs locks owned by any of those states
3/ various transient counts such as for conflicting locks.
When it is tested against '2' it is clear that one of these is the
transient reference obtained by find_lockowner_str_locked(). It is not
clear what the other one is expected to be.
In practice, the count is often 2 because there is precisely one state
on so_stateids. If there were more, this would fail.
In my testing I see two circumstances when RELEASE_LOCKOWNER is called.
In one case, CLOSE is called before RELEASE_LOCKOWNER. That results in
all the lock states being removed, and so the lockowner being discarded
(it is removed when there are no more references which usually happens
when the lock state is discarded). When nfsd4_release_lockowner() finds
that the lock owner doesn't exist, it returns success.
The other case shows an so_count of '2' and precisely one state listed
in so_stateid. It appears that the Linux client uses a separate lock
owner for each file resulting in one lock state per lock owner, so this
test on '2' is safe. For another client it might not be safe.
So this patch changes check_for_locks() to use the (newish)
find_any_file_locked() so that it doesn't take a reference on the
nfs4_file and so never calls nfsd_file_put(), and so never sleeps. With
this check is it safe to restore the use of check_for_locks() rather
than testing so_count against the mysterious '2'.
Fixes: ce3c4ad7f4ce ("NFSD: Fix possible sleep during nfsd4_release_lockowner()")
Signed-off-by: NeilBrown <neilb(a)suse.de>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Cc: stable(a)vger.kernel.org # v6.2+
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2fa54cfd4882..6dc6340e2852 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7911,14 +7911,16 @@ check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
{
struct file_lock *fl;
int status = false;
- struct nfsd_file *nf = find_any_file(fp);
+ struct nfsd_file *nf;
struct inode *inode;
struct file_lock_context *flctx;
+ spin_lock(&fp->fi_lock);
+ nf = find_any_file_locked(fp);
if (!nf) {
/* Any valid lock stateid should have some sort of access */
WARN_ON_ONCE(1);
- return status;
+ goto out;
}
inode = file_inode(nf->nf_file);
@@ -7934,7 +7936,8 @@ check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
}
spin_unlock(&flctx->flc_lock);
}
- nfsd_file_put(nf);
+out:
+ spin_unlock(&fp->fi_lock);
return status;
}
@@ -7944,10 +7947,8 @@ check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
* @cstate: NFSv4 COMPOUND state
* @u: RELEASE_LOCKOWNER arguments
*
- * The lockowner's so_count is bumped when a lock record is added
- * or when copying a conflicting lock. The latter case is brief,
- * but can lead to fleeting false positives when looking for
- * locks-in-use.
+ * Check if theree are any locks still held and if not - free the lockowner
+ * and any lock state that is owned.
*
* Return values:
* %nfs_ok: lockowner released or not found
@@ -7983,10 +7984,13 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
spin_unlock(&clp->cl_lock);
return nfs_ok;
}
- if (atomic_read(&lo->lo_owner.so_count) != 2) {
- spin_unlock(&clp->cl_lock);
- nfs4_put_stateowner(&lo->lo_owner);
- return nfserr_locks_held;
+
+ list_for_each_entry(stp, &lo->lo_owner.so_stateids, st_perstateowner) {
+ if (check_for_locks(stp->st_stid.sc_file, lo)) {
+ spin_unlock(&clp->cl_lock);
+ nfs4_put_stateowner(&lo->lo_owner);
+ return nfserr_locks_held;
+ }
}
unhash_lockowner_locked(lo);
while (!list_empty(&lo->lo_owner.so_stateids)) {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 02444f2ac26eae6385a65fcd66915084d15dffba
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012707-pushover-sherry-f45f@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
02444f2ac26e ("btrfs: zoned: optimize hint byte for zoned allocator")
b271fee9a41c ("btrfs: zoned: factor out prepare_allocation_zoned()")
c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group")
be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
a12b0dc0aa4d ("btrfs: move ffe_ctl one level up")
2e654e4bb9ac ("btrfs: zoned: activate block group on allocation")
afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
dafc340dbd10 ("btrfs: zoned: introduce physical_map to btrfs_block_group")
98173255bddd ("btrfs: zoned: calculate free space from zone capacity")
8eae532be753 ("btrfs: zoned: load zone capacity information from devices")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 02444f2ac26eae6385a65fcd66915084d15dffba Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota(a)wdc.com>
Date: Tue, 19 Dec 2023 01:02:29 +0900
Subject: [PATCH] btrfs: zoned: optimize hint byte for zoned allocator
Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
The performance goes down because of increased latency of the extent
allocation, which is induced by a traversing of a lot of full block groups.
So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
with sufficient size from the active block group list, which does not
contain full block groups.
After applying the patch, the performance is maintained well.
Fixes: 2eda57089ea3 ("btrfs: zoned: implement sequential extent allocation")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d260b970bec7..6d680031211a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4311,6 +4311,24 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
if (fs_info->data_reloc_bg)
ffe_ctl->hint_byte = fs_info->data_reloc_bg;
spin_unlock(&fs_info->relocation_bg_lock);
+ } else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
+ struct btrfs_block_group *block_group;
+
+ spin_lock(&fs_info->zone_active_bgs_lock);
+ list_for_each_entry(block_group, &fs_info->zone_active_bgs, active_bg_list) {
+ /*
+ * No lock is OK here because avail is monotinically
+ * decreasing, and this is just a hint.
+ */
+ u64 avail = block_group->zone_capacity - block_group->alloc_offset;
+
+ if (block_group_bits(block_group, ffe_ctl->flags) &&
+ avail >= ffe_ctl->num_bytes) {
+ ffe_ctl->hint_byte = block_group->start;
+ break;
+ }
+ }
+ spin_unlock(&fs_info->zone_active_bgs_lock);
}
return 0;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 02444f2ac26eae6385a65fcd66915084d15dffba
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012705-replace-mumble-78cc@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
02444f2ac26e ("btrfs: zoned: optimize hint byte for zoned allocator")
b271fee9a41c ("btrfs: zoned: factor out prepare_allocation_zoned()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 02444f2ac26eae6385a65fcd66915084d15dffba Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota(a)wdc.com>
Date: Tue, 19 Dec 2023 01:02:29 +0900
Subject: [PATCH] btrfs: zoned: optimize hint byte for zoned allocator
Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
The performance goes down because of increased latency of the extent
allocation, which is induced by a traversing of a lot of full block groups.
So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
with sufficient size from the active block group list, which does not
contain full block groups.
After applying the patch, the performance is maintained well.
Fixes: 2eda57089ea3 ("btrfs: zoned: implement sequential extent allocation")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d260b970bec7..6d680031211a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4311,6 +4311,24 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
if (fs_info->data_reloc_bg)
ffe_ctl->hint_byte = fs_info->data_reloc_bg;
spin_unlock(&fs_info->relocation_bg_lock);
+ } else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
+ struct btrfs_block_group *block_group;
+
+ spin_lock(&fs_info->zone_active_bgs_lock);
+ list_for_each_entry(block_group, &fs_info->zone_active_bgs, active_bg_list) {
+ /*
+ * No lock is OK here because avail is monotinically
+ * decreasing, and this is just a hint.
+ */
+ u64 avail = block_group->zone_capacity - block_group->alloc_offset;
+
+ if (block_group_bits(block_group, ffe_ctl->flags) &&
+ avail >= ffe_ctl->num_bytes) {
+ ffe_ctl->hint_byte = block_group->start;
+ break;
+ }
+ }
+ spin_unlock(&fs_info->zone_active_bgs_lock);
}
return 0;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 02444f2ac26eae6385a65fcd66915084d15dffba
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012704-outclass-seventy-a7bb@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
02444f2ac26e ("btrfs: zoned: optimize hint byte for zoned allocator")
b271fee9a41c ("btrfs: zoned: factor out prepare_allocation_zoned()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 02444f2ac26eae6385a65fcd66915084d15dffba Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota(a)wdc.com>
Date: Tue, 19 Dec 2023 01:02:29 +0900
Subject: [PATCH] btrfs: zoned: optimize hint byte for zoned allocator
Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
The performance goes down because of increased latency of the extent
allocation, which is induced by a traversing of a lot of full block groups.
So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
with sufficient size from the active block group list, which does not
contain full block groups.
After applying the patch, the performance is maintained well.
Fixes: 2eda57089ea3 ("btrfs: zoned: implement sequential extent allocation")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d260b970bec7..6d680031211a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4311,6 +4311,24 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
if (fs_info->data_reloc_bg)
ffe_ctl->hint_byte = fs_info->data_reloc_bg;
spin_unlock(&fs_info->relocation_bg_lock);
+ } else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
+ struct btrfs_block_group *block_group;
+
+ spin_lock(&fs_info->zone_active_bgs_lock);
+ list_for_each_entry(block_group, &fs_info->zone_active_bgs, active_bg_list) {
+ /*
+ * No lock is OK here because avail is monotinically
+ * decreasing, and this is just a hint.
+ */
+ u64 avail = block_group->zone_capacity - block_group->alloc_offset;
+
+ if (block_group_bits(block_group, ffe_ctl->flags) &&
+ avail >= ffe_ctl->num_bytes) {
+ ffe_ctl->hint_byte = block_group->start;
+ break;
+ }
+ }
+ spin_unlock(&fs_info->zone_active_bgs_lock);
}
return 0;
The patch below does not apply to the 6.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.7.y
git checkout FETCH_HEAD
git cherry-pick -x 02444f2ac26eae6385a65fcd66915084d15dffba
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012702-algorithm-mongoose-277c@gregkh' --subject-prefix 'PATCH 6.7.y' HEAD^..
Possible dependencies:
02444f2ac26e ("btrfs: zoned: optimize hint byte for zoned allocator")
b271fee9a41c ("btrfs: zoned: factor out prepare_allocation_zoned()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 02444f2ac26eae6385a65fcd66915084d15dffba Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota(a)wdc.com>
Date: Tue, 19 Dec 2023 01:02:29 +0900
Subject: [PATCH] btrfs: zoned: optimize hint byte for zoned allocator
Writing sequentially to a huge file on btrfs on a SMR HDD revealed a
decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes).
The performance goes down because of increased latency of the extent
allocation, which is induced by a traversing of a lot of full block groups.
So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group
with sufficient size from the active block group list, which does not
contain full block groups.
After applying the patch, the performance is maintained well.
Fixes: 2eda57089ea3 ("btrfs: zoned: implement sequential extent allocation")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d260b970bec7..6d680031211a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4311,6 +4311,24 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
if (fs_info->data_reloc_bg)
ffe_ctl->hint_byte = fs_info->data_reloc_bg;
spin_unlock(&fs_info->relocation_bg_lock);
+ } else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
+ struct btrfs_block_group *block_group;
+
+ spin_lock(&fs_info->zone_active_bgs_lock);
+ list_for_each_entry(block_group, &fs_info->zone_active_bgs, active_bg_list) {
+ /*
+ * No lock is OK here because avail is monotinically
+ * decreasing, and this is just a hint.
+ */
+ u64 avail = block_group->zone_capacity - block_group->alloc_offset;
+
+ if (block_group_bits(block_group, ffe_ctl->flags) &&
+ avail >= ffe_ctl->num_bytes) {
+ ffe_ctl->hint_byte = block_group->start;
+ break;
+ }
+ }
+ spin_unlock(&fs_info->zone_active_bgs_lock);
}
return 0;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 7081929ab2572920e94d70be3d332e5c9f97095a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012750-shore-gradually-d4bb@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
7081929ab257 ("btrfs: don't abort filesystem when attempting to snapshot deleted subvolume")
813febdbe6c9 ("btrfs: disable snapshot creation/deletion for extent tree v2")
4467af880929 ("btrfs: remove root argument from btrfs_unlink_inode()")
bd54f381a12a ("btrfs: do not pin logs too early during renames")
9a56fcd15a9c ("btrfs: make btrfs_update_inode take btrfs_inode")
76aea5379678 ("btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode")
2766ff61762c ("btrfs: update the number of bytes used by an inode atomically")
5893dfb98f25 ("btrfs: refactor btrfs_drop_extents() to make it easier to extend")
ac5887c8e013 ("btrfs: locking: remove all the blocking helpers")
a14b78ad06ab ("btrfs: introduce btrfs_inode_lock()/unlock()")
b8d8e1fd570a ("btrfs: introduce btrfs_write_check()")
c86537a42f86 ("btrfs: check FS error state bit early during write")
5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
4e4cabece9f9 ("btrfs: split btrfs_direct_IO to read and write")
196d59ab9ccc ("btrfs: switch extent buffer tree lock to rw_semaphore")
0425e7badbdc ("btrfs: don't fallback to buffered read if we don't need to")
3c38c877fcb9 ("btrfs: sink inode argument in insert_ordered_extent_file_extent")
fc0d82e103c7 ("btrfs: sink total_data parameter in setup_items_for_insert")
3dc9dc8969dc ("btrfs: eliminate total_size parameter from setup_items_for_insert")
0cbb5bdfea26 ("btrfs: rename btrfs_insert_clone_extent() to a more generic name")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7081929ab2572920e94d70be3d332e5c9f97095a Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 4 Jan 2024 11:48:46 -0800
Subject: [PATCH] btrfs: don't abort filesystem when attempting to snapshot
deleted subvolume
If the source file descriptor to the snapshot ioctl refers to a deleted
subvolume, we get the following abort:
BTRFS: Transaction aborted (error -2)
WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
Call Trace:
<TASK>
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? __warn+0x81/0x130
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? report_bug+0x171/0x1a0
? handle_bug+0x3a/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? create_pending_snapshot+0x1040/0x1190 [btrfs]
create_pending_snapshots+0x92/0xc0 [btrfs]
btrfs_commit_transaction+0x66b/0xf40 [btrfs]
btrfs_mksubvol+0x301/0x4d0 [btrfs]
btrfs_mksnapshot+0x80/0xb0 [btrfs]
__btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
btrfs_ioctl+0x8a6/0x2650 [btrfs]
? kmem_cache_free+0x22/0x340
? do_sys_openat2+0x97/0xe0
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x46/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7fe20abe83af
RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
BTRFS info (device vdc: state EA): forced readonly
BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
This happens because create_pending_snapshot() initializes the new root
item as a copy of the source root item. This includes the refs field,
which is 0 for a deleted subvolume. The call to btrfs_insert_root()
therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
finds the root and returns -ENOENT if refs == 0, which causes
create_pending_snapshot() to abort.
Fix it by checking the source root's refs before attempting the
snapshot, but after locking subvol_sem to avoid racing with deletion.
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4e50b62db2a8..fea5d37528b8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -790,6 +790,9 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
return -EOPNOTSUPP;
}
+ if (btrfs_root_refs(&root->root_item) == 0)
+ return -ENOENT;
+
if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
return -EINVAL;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 7081929ab2572920e94d70be3d332e5c9f97095a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012747-marshland-overcoat-1e01@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
7081929ab257 ("btrfs: don't abort filesystem when attempting to snapshot deleted subvolume")
813febdbe6c9 ("btrfs: disable snapshot creation/deletion for extent tree v2")
4467af880929 ("btrfs: remove root argument from btrfs_unlink_inode()")
bd54f381a12a ("btrfs: do not pin logs too early during renames")
9a56fcd15a9c ("btrfs: make btrfs_update_inode take btrfs_inode")
76aea5379678 ("btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode")
2766ff61762c ("btrfs: update the number of bytes used by an inode atomically")
5893dfb98f25 ("btrfs: refactor btrfs_drop_extents() to make it easier to extend")
ac5887c8e013 ("btrfs: locking: remove all the blocking helpers")
a14b78ad06ab ("btrfs: introduce btrfs_inode_lock()/unlock()")
b8d8e1fd570a ("btrfs: introduce btrfs_write_check()")
c86537a42f86 ("btrfs: check FS error state bit early during write")
5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
4e4cabece9f9 ("btrfs: split btrfs_direct_IO to read and write")
196d59ab9ccc ("btrfs: switch extent buffer tree lock to rw_semaphore")
0425e7badbdc ("btrfs: don't fallback to buffered read if we don't need to")
3c38c877fcb9 ("btrfs: sink inode argument in insert_ordered_extent_file_extent")
fc0d82e103c7 ("btrfs: sink total_data parameter in setup_items_for_insert")
3dc9dc8969dc ("btrfs: eliminate total_size parameter from setup_items_for_insert")
0cbb5bdfea26 ("btrfs: rename btrfs_insert_clone_extent() to a more generic name")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7081929ab2572920e94d70be3d332e5c9f97095a Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 4 Jan 2024 11:48:46 -0800
Subject: [PATCH] btrfs: don't abort filesystem when attempting to snapshot
deleted subvolume
If the source file descriptor to the snapshot ioctl refers to a deleted
subvolume, we get the following abort:
BTRFS: Transaction aborted (error -2)
WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
Call Trace:
<TASK>
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? __warn+0x81/0x130
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? report_bug+0x171/0x1a0
? handle_bug+0x3a/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? create_pending_snapshot+0x1040/0x1190 [btrfs]
create_pending_snapshots+0x92/0xc0 [btrfs]
btrfs_commit_transaction+0x66b/0xf40 [btrfs]
btrfs_mksubvol+0x301/0x4d0 [btrfs]
btrfs_mksnapshot+0x80/0xb0 [btrfs]
__btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
btrfs_ioctl+0x8a6/0x2650 [btrfs]
? kmem_cache_free+0x22/0x340
? do_sys_openat2+0x97/0xe0
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x46/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7fe20abe83af
RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
BTRFS info (device vdc: state EA): forced readonly
BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
This happens because create_pending_snapshot() initializes the new root
item as a copy of the source root item. This includes the refs field,
which is 0 for a deleted subvolume. The call to btrfs_insert_root()
therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
finds the root and returns -ENOENT if refs == 0, which causes
create_pending_snapshot() to abort.
Fix it by checking the source root's refs before attempting the
snapshot, but after locking subvol_sem to avoid racing with deletion.
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4e50b62db2a8..fea5d37528b8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -790,6 +790,9 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
return -EOPNOTSUPP;
}
+ if (btrfs_root_refs(&root->root_item) == 0)
+ return -ENOENT;
+
if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
return -EINVAL;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 3324d0547861b16cf436d54abba7052e0c8aa9de
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012738-abdomen-crimson-3018@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
3324d0547861 ("btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted")
60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted")
dd0734f2a866 ("btrfs: fix race between swap file activation and snapshot creation")
ee0d904fd9c5 ("btrfs: remove err variable from btrfs_delete_subvolume")
c3e1f96c37d0 ("btrfs: enumerate the type of exclusive operation in progress")
e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
adca4d945c8d ("btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT")
c11fbb6ed0dd ("btrfs: reduce lock contention when creating snapshot")
63f018be577f ("btrfs: Remove __ prefix from btrfs_block_rsv_release")
dcc3eb9638c3 ("btrfs: convert snapshot/nocow exlcusion to drew lock")
0024652895e3 ("btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root")
bd647ce385ec ("btrfs: add a leak check for roots")
8260edba67a2 ("btrfs: make the init of static elements in fs_info separate")
ae18c37ad5a1 ("btrfs: move fs_info init work into it's own helper function")
141386e1a5d6 ("btrfs: free more things in btrfs_free_fs_info")
bc44d7c4b2b1 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root")
81f096edf047 ("btrfs: use btrfs_put_fs_root to free roots always")
0d4b0463011d ("btrfs: export and rename free_fs_info")
fbb0ce40d606 ("btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry")
ca2037fba6af ("btrfs: hold a ref on the root in btrfs_recover_log_trees")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3324d0547861b16cf436d54abba7052e0c8aa9de Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 4 Jan 2024 11:48:47 -0800
Subject: [PATCH] btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot
of subvolume being deleted
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable(a)vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b3e39610cc95..7bcc1c03437a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4458,6 +4458,8 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
u64 root_flags;
int ret;
+ down_write(&fs_info->subvol_sem);
+
/*
* Don't allow to delete a subvolume with send in progress. This is
* inside the inode lock so the error handling that has to drop the bit
@@ -4469,25 +4471,25 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
btrfs_warn(fs_info,
"attempt to delete subvolume %llu during send",
dest->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
if (atomic_read(&dest->nr_swapfiles)) {
spin_unlock(&dest->root_item_lock);
btrfs_warn(fs_info,
"attempt to delete subvolume %llu with active swapfile",
root->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags | BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- down_write(&fs_info->subvol_sem);
-
ret = may_destroy_subvol(dest);
if (ret)
- goto out_up_write;
+ goto out_undead;
btrfs_init_block_rsv(&block_rsv, BTRFS_BLOCK_RSV_TEMP);
/*
@@ -4497,7 +4499,7 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
*/
ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, 5, true);
if (ret)
- goto out_up_write;
+ goto out_undead;
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
@@ -4563,15 +4565,17 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
inode->i_flags |= S_DEAD;
out_release:
btrfs_subvolume_release_metadata(root, &block_rsv);
-out_up_write:
- up_write(&fs_info->subvol_sem);
+out_undead:
if (ret) {
spin_lock(&dest->root_item_lock);
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags & ~BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- } else {
+ }
+out_up_write:
+ up_write(&fs_info->subvol_sem);
+ if (!ret) {
d_invalidate(dentry);
btrfs_prune_dentries(dest);
ASSERT(dest->send_in_progress == 0);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 3324d0547861b16cf436d54abba7052e0c8aa9de
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012736-narrow-thievish-cc02@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
3324d0547861 ("btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted")
60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted")
dd0734f2a866 ("btrfs: fix race between swap file activation and snapshot creation")
ee0d904fd9c5 ("btrfs: remove err variable from btrfs_delete_subvolume")
c3e1f96c37d0 ("btrfs: enumerate the type of exclusive operation in progress")
e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
adca4d945c8d ("btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT")
c11fbb6ed0dd ("btrfs: reduce lock contention when creating snapshot")
63f018be577f ("btrfs: Remove __ prefix from btrfs_block_rsv_release")
dcc3eb9638c3 ("btrfs: convert snapshot/nocow exlcusion to drew lock")
0024652895e3 ("btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root")
bd647ce385ec ("btrfs: add a leak check for roots")
8260edba67a2 ("btrfs: make the init of static elements in fs_info separate")
ae18c37ad5a1 ("btrfs: move fs_info init work into it's own helper function")
141386e1a5d6 ("btrfs: free more things in btrfs_free_fs_info")
bc44d7c4b2b1 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root")
81f096edf047 ("btrfs: use btrfs_put_fs_root to free roots always")
0d4b0463011d ("btrfs: export and rename free_fs_info")
fbb0ce40d606 ("btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry")
ca2037fba6af ("btrfs: hold a ref on the root in btrfs_recover_log_trees")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3324d0547861b16cf436d54abba7052e0c8aa9de Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 4 Jan 2024 11:48:47 -0800
Subject: [PATCH] btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot
of subvolume being deleted
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable(a)vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b3e39610cc95..7bcc1c03437a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4458,6 +4458,8 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
u64 root_flags;
int ret;
+ down_write(&fs_info->subvol_sem);
+
/*
* Don't allow to delete a subvolume with send in progress. This is
* inside the inode lock so the error handling that has to drop the bit
@@ -4469,25 +4471,25 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
btrfs_warn(fs_info,
"attempt to delete subvolume %llu during send",
dest->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
if (atomic_read(&dest->nr_swapfiles)) {
spin_unlock(&dest->root_item_lock);
btrfs_warn(fs_info,
"attempt to delete subvolume %llu with active swapfile",
root->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags | BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- down_write(&fs_info->subvol_sem);
-
ret = may_destroy_subvol(dest);
if (ret)
- goto out_up_write;
+ goto out_undead;
btrfs_init_block_rsv(&block_rsv, BTRFS_BLOCK_RSV_TEMP);
/*
@@ -4497,7 +4499,7 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
*/
ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, 5, true);
if (ret)
- goto out_up_write;
+ goto out_undead;
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
@@ -4563,15 +4565,17 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
inode->i_flags |= S_DEAD;
out_release:
btrfs_subvolume_release_metadata(root, &block_rsv);
-out_up_write:
- up_write(&fs_info->subvol_sem);
+out_undead:
if (ret) {
spin_lock(&dest->root_item_lock);
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags & ~BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- } else {
+ }
+out_up_write:
+ up_write(&fs_info->subvol_sem);
+ if (!ret) {
d_invalidate(dentry);
btrfs_prune_dentries(dest);
ASSERT(dest->send_in_progress == 0);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 3324d0547861b16cf436d54abba7052e0c8aa9de
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012735-gentile-overture-2afa@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
3324d0547861 ("btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted")
60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted")
dd0734f2a866 ("btrfs: fix race between swap file activation and snapshot creation")
ee0d904fd9c5 ("btrfs: remove err variable from btrfs_delete_subvolume")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3324d0547861b16cf436d54abba7052e0c8aa9de Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 4 Jan 2024 11:48:47 -0800
Subject: [PATCH] btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot
of subvolume being deleted
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable(a)vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel(a)dorminy.me>
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b3e39610cc95..7bcc1c03437a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4458,6 +4458,8 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
u64 root_flags;
int ret;
+ down_write(&fs_info->subvol_sem);
+
/*
* Don't allow to delete a subvolume with send in progress. This is
* inside the inode lock so the error handling that has to drop the bit
@@ -4469,25 +4471,25 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
btrfs_warn(fs_info,
"attempt to delete subvolume %llu during send",
dest->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
if (atomic_read(&dest->nr_swapfiles)) {
spin_unlock(&dest->root_item_lock);
btrfs_warn(fs_info,
"attempt to delete subvolume %llu with active swapfile",
root->root_key.objectid);
- return -EPERM;
+ ret = -EPERM;
+ goto out_up_write;
}
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags | BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- down_write(&fs_info->subvol_sem);
-
ret = may_destroy_subvol(dest);
if (ret)
- goto out_up_write;
+ goto out_undead;
btrfs_init_block_rsv(&block_rsv, BTRFS_BLOCK_RSV_TEMP);
/*
@@ -4497,7 +4499,7 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
*/
ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, 5, true);
if (ret)
- goto out_up_write;
+ goto out_undead;
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
@@ -4563,15 +4565,17 @@ int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
inode->i_flags |= S_DEAD;
out_release:
btrfs_subvolume_release_metadata(root, &block_rsv);
-out_up_write:
- up_write(&fs_info->subvol_sem);
+out_undead:
if (ret) {
spin_lock(&dest->root_item_lock);
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags & ~BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(&dest->root_item_lock);
- } else {
+ }
+out_up_write:
+ up_write(&fs_info->subvol_sem);
+ if (!ret) {
d_invalidate(dentry);
btrfs_prune_dentries(dest);
ASSERT(dest->send_in_progress == 0);