The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 232796df8c1437c41d308d161007f0715bac0a54 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 Oct 2021 18:30:25 +0100
Subject: [PATCH] btrfs: fix deadlock between quota enable and other quota
operations
When enabling quotas, we attempt to commit a transaction while holding the
mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
quota operations such as:
- qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
- adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
This is because these operations join a transaction and after that they
attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
after joining or starting a transaction is a pattern followed everywhere
in qgroups, so the quota enablement operation is the one at fault here,
and should not commit a transaction while holding that mutex.
Fix this by making the transaction commit while not holding the mutex.
We are safe from two concurrent tasks trying to enable quotas because
we are serialized by the rw semaphore fs_info->subvol_sem at
btrfs_ioctl_quota_ctl(), which is the only call site for enabling
quotas.
When this deadlock happens, it produces a trace like the following:
INFO: task syz-executor:25604 blocked for more than 143 seconds.
Not tainted 5.15.0-rc6 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f86920b2c4d
RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
INFO: task syz-executor:25628 blocked for more than 143 seconds.
Not tainted 5.15.0-rc6 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: Hao Sun <sunhao.th(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXT…
Fixes: 340f1aa27f36 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
CC: stable(a)vger.kernel.org # 4.19
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 6c037f1252b7..071f7334f818 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -940,6 +940,14 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
int ret = 0;
int slot;
+ /*
+ * We need to have subvol_sem write locked, to prevent races between
+ * concurrent tasks trying to enable quotas, because we will unlock
+ * and relock qgroup_ioctl_lock before setting fs_info->quota_root
+ * and before setting BTRFS_FS_QUOTA_ENABLED.
+ */
+ lockdep_assert_held_write(&fs_info->subvol_sem);
+
mutex_lock(&fs_info->qgroup_ioctl_lock);
if (fs_info->quota_root)
goto out;
@@ -1117,8 +1125,19 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
goto out_free_path;
}
+ mutex_unlock(&fs_info->qgroup_ioctl_lock);
+ /*
+ * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
+ * a deadlock with tasks concurrently doing other qgroup operations, such
+ * adding/removing qgroups or adding/deleting qgroup relations for example,
+ * because all qgroup operations first start or join a transaction and then
+ * lock the qgroup_ioctl_lock mutex.
+ * We are safe from a concurrent task trying to enable quotas, by calling
+ * this function, since we are serialized by fs_info->subvol_sem.
+ */
ret = btrfs_commit_transaction(trans);
trans = NULL;
+ mutex_lock(&fs_info->qgroup_ioctl_lock);
if (ret)
goto out_free_path;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 16beac87e95e2fb278b552397c8260637f8a63f7 Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota(a)wdc.com>
Date: Thu, 11 Nov 2021 14:14:38 +0900
Subject: [PATCH] btrfs: zoned: cache reported zone during mount
When mounting a device, we are reporting the zones twice: once for
checking the zone attributes in btrfs_get_dev_zone_info and once for
loading block groups' zone info in
btrfs_load_block_group_zone_info(). With a lot of block groups, that
leads to a lot of REPORT ZONE commands and slows down the mount
process.
This patch introduces a zone info cache in struct
btrfs_zoned_device_info. The cache is populated while in
btrfs_get_dev_zone_info() and used for
btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
commands. The zone cache is then released after loading the block
groups, as it will not be much effective during the run time.
Benchmark: Mount an HDD with 57,007 block groups
Before patch: 171.368 seconds
After patch: 64.064 seconds
While it still takes a minute due to the slowness of loading all the
block groups, the patch reduces the mount time by 1/3.
Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT…
Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices")
CC: stable(a)vger.kernel.org
Signed-off-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 82769f1c17ee..66fa61cb3f23 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -322,7 +322,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
device->fs_devices = fs_info->fs_devices;
- ret = btrfs_get_dev_zone_info(device);
+ ret = btrfs_get_dev_zone_info(device, false);
if (ret)
goto error;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6408948b3e2c..67533b13e1eb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3570,6 +3570,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
goto fail_sysfs;
}
+ btrfs_free_zone_cache(fs_info);
+
if (!sb_rdonly(sb) && fs_info->fs_devices->missing_devices &&
!btrfs_check_rw_degradable(fs_info, NULL)) {
btrfs_warn(fs_info,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53753e04af14..cafd490da072 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2669,7 +2669,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
device->fs_info = fs_info;
device->bdev = bdev;
- ret = btrfs_get_dev_zone_info(device);
+ ret = btrfs_get_dev_zone_info(device, false);
if (ret)
goto error_free_device;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 678a29469511..b06059a5db2a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -5,6 +5,7 @@
#include <linux/blkdev.h>
#include <linux/sched/mm.h>
#include <linux/atomic.h>
+#include <linux/vmalloc.h>
#include "ctree.h"
#include "volumes.h"
#include "zoned.h"
@@ -213,6 +214,8 @@ static int emulate_report_zones(struct btrfs_device *device, u64 pos,
static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
struct blk_zone *zones, unsigned int *nr_zones)
{
+ struct btrfs_zoned_device_info *zinfo = device->zone_info;
+ u32 zno;
int ret;
if (!*nr_zones)
@@ -224,6 +227,34 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
return 0;
}
+ /* Check cache */
+ if (zinfo->zone_cache) {
+ unsigned int i;
+
+ ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
+ zno = pos >> zinfo->zone_size_shift;
+ /*
+ * We cannot report zones beyond the zone end. So, it is OK to
+ * cap *nr_zones to at the end.
+ */
+ *nr_zones = min_t(u32, *nr_zones, zinfo->nr_zones - zno);
+
+ for (i = 0; i < *nr_zones; i++) {
+ struct blk_zone *zone_info;
+
+ zone_info = &zinfo->zone_cache[zno + i];
+ if (!zone_info->len)
+ break;
+ }
+
+ if (i == *nr_zones) {
+ /* Cache hit on all the zones */
+ memcpy(zones, zinfo->zone_cache + zno,
+ sizeof(*zinfo->zone_cache) * *nr_zones);
+ return 0;
+ }
+ }
+
ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones,
copy_zone_info_cb, zones);
if (ret < 0) {
@@ -237,6 +268,11 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
if (!ret)
return -EIO;
+ /* Populate cache */
+ if (zinfo->zone_cache)
+ memcpy(zinfo->zone_cache + zno, zones,
+ sizeof(*zinfo->zone_cache) * *nr_zones);
+
return 0;
}
@@ -300,7 +336,7 @@ int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
if (!device->bdev)
continue;
- ret = btrfs_get_dev_zone_info(device);
+ ret = btrfs_get_dev_zone_info(device, true);
if (ret)
break;
}
@@ -309,7 +345,7 @@ int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
return ret;
}
-int btrfs_get_dev_zone_info(struct btrfs_device *device)
+int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
{
struct btrfs_fs_info *fs_info = device->fs_info;
struct btrfs_zoned_device_info *zone_info = NULL;
@@ -339,6 +375,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
if (!zone_info)
return -ENOMEM;
+ device->zone_info = zone_info;
+
if (!bdev_is_zoned(bdev)) {
if (!fs_info->zone_size) {
ret = calculate_emulated_zone_size(fs_info);
@@ -407,6 +445,23 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
goto out;
}
+ /*
+ * Enable zone cache only for a zoned device. On a non-zoned device, we
+ * fill the zone info with emulated CONVENTIONAL zones, so no need to
+ * use the cache.
+ */
+ if (populate_cache && bdev_is_zoned(device->bdev)) {
+ zone_info->zone_cache = vzalloc(sizeof(struct blk_zone) *
+ zone_info->nr_zones);
+ if (!zone_info->zone_cache) {
+ btrfs_err_in_rcu(device->fs_info,
+ "zoned: failed to allocate zone cache for %s",
+ rcu_str_deref(device->name));
+ ret = -ENOMEM;
+ goto out;
+ }
+ }
+
/* Get zones type */
nactive = 0;
while (sector < nr_sectors) {
@@ -505,8 +560,6 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
kfree(zones);
- device->zone_info = zone_info;
-
switch (bdev_zoned_model(bdev)) {
case BLK_ZONED_HM:
model = "host-managed zoned";
@@ -539,11 +592,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
out:
kfree(zones);
out_free_zone_info:
- bitmap_free(zone_info->active_zones);
- bitmap_free(zone_info->empty_zones);
- bitmap_free(zone_info->seq_zones);
- kfree(zone_info);
- device->zone_info = NULL;
+ btrfs_destroy_dev_zone_info(device);
return ret;
}
@@ -558,6 +607,7 @@ void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
bitmap_free(zone_info->active_zones);
bitmap_free(zone_info->seq_zones);
bitmap_free(zone_info->empty_zones);
+ vfree(zone_info->zone_cache);
kfree(zone_info);
device->zone_info = NULL;
}
@@ -1975,3 +2025,21 @@ void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
fs_info->data_reloc_bg = 0;
spin_unlock(&fs_info->relocation_bg_lock);
}
+
+void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+
+ if (!btrfs_is_zoned(fs_info))
+ return;
+
+ mutex_lock(&fs_devices->device_list_mutex);
+ list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ if (device->zone_info) {
+ vfree(device->zone_info->zone_cache);
+ device->zone_info->zone_cache = NULL;
+ }
+ }
+ mutex_unlock(&fs_devices->device_list_mutex);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index e53ab7b96437..4344f4818389 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -28,6 +28,7 @@ struct btrfs_zoned_device_info {
unsigned long *seq_zones;
unsigned long *empty_zones;
unsigned long *active_zones;
+ struct blk_zone *zone_cache;
struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
};
@@ -35,7 +36,7 @@ struct btrfs_zoned_device_info {
int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
struct blk_zone *zone);
int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info);
-int btrfs_get_dev_zone_info(struct btrfs_device *device);
+int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache);
void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
@@ -76,6 +77,7 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices,
void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info, u64 logical,
u64 length);
void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg);
+void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
#else /* CONFIG_BLK_DEV_ZONED */
static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
struct blk_zone *zone)
@@ -88,7 +90,8 @@ static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_i
return 0;
}
-static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
+static inline int btrfs_get_dev_zone_info(struct btrfs_device *device,
+ bool populate_cache)
{
return 0;
}
@@ -232,6 +235,7 @@ static inline void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info,
static inline void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg) { }
+static inline void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info) { }
#endif
static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
Due to commit 8cfb9015280d ("NFS: Always provide aligned buffers to the
RPC read layers") on the client, a read of 0xfff is aligned up to server
rsize of 0x1000.
As a result, in a test where the server has a file of size
0x7fffffffffffffff, and the client tries to read from the offset
0x7ffffffffffff000, the read causes loff_t overflow in the server and it
returns an NFS code of EINVAL to the client. The client as a result
indefinitely retries the request.
This fixes the issue at server side by trimming reads past
NFS_OFFSET_MAX. It also adds a missing check for out of bound offset in
NFSv3, copying a similar check from NFSv4.x.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Aloni <dan.aloni(a)vastdata.com>
---
fs/nfsd/nfs4proc.c | 3 +++
fs/nfsd/vfs.c | 6 ++++++
2 files changed, 9 insertions(+)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 486c5dba4b65..816bdf212559 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -785,6 +785,9 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if (read->rd_offset >= OFFSET_MAX)
return nfserr_inval;
+ if (unlikely(read->rd_offset + read->rd_length > OFFSET_MAX))
+ read->rd_length = OFFSET_MAX - read->rd_offset;
+
trace_nfsd_read_start(rqstp, &cstate->current_fh,
read->rd_offset, read->rd_length);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 738d564ca4ce..ad4df374433e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1045,6 +1045,12 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct file *file;
__be32 err;
+ if (unlikely(offset >= NFS_OFFSET_MAX))
+ return nfserr_inval;
+
+ if (unlikely(offset + *count > NFS_OFFSET_MAX))
+ *count = NFS_OFFSET_MAX - offset;
+
trace_nfsd_read_start(rqstp, fhp, offset, *count);
err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
if (err)
--
2.23.0