The patch below does not apply to the 5.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.20.y
git checkout FETCH_HEAD
git cherry-pick -x 50d281fc434cb8e2497f5e70a309ccca6b1a09f0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023040334-reproduce-granola-78a1@gregkh' --subject-prefix 'PATCH 5.20.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 50d281fc434cb8e2497f5e70a309ccca6b1a09f0 Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain(a)oracle.com>
Date: Thu, 23 Mar 2023 15:56:48 +0800
Subject: [PATCH] btrfs: scan device in non-exclusive mode
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable(a)vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d0124b6e79e..ac0e8fb92fc8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1366,8 +1366,17 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
* So, we need to add a special mount option to scan for
* later supers, using BTRFS_SUPER_MIRROR_MAX instead
*/
- flags |= FMODE_EXCL;
+ /*
+ * Avoid using flag |= FMODE_EXCL here, as the systemd-udev may
+ * initiate the device scan which may race with the user's mount
+ * or mkfs command, resulting in failure.
+ * Since the device scan is solely for reading purposes, there is
+ * no need for FMODE_EXCL. Additionally, the devices are read again
+ * during the mount process. It is ok to get some inconsistent
+ * values temporarily, as the device paths of the fsid are the only
+ * required information for assembling the volume.
+ */
bdev = blkdev_get_by_path(path, flags, holder);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 50d281fc434cb8e2497f5e70a309ccca6b1a09f0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023040333-vanquish-wriggle-e007@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
50d281fc434c ("btrfs: scan device in non-exclusive mode")
12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode")
5d1ab66c56fe ("btrfs: disallow space_cache in ZONED mode")
b70f509774ad ("btrfs: check and enable ZONED mode")
5b316468983d ("btrfs: get zone information of zoned block devices")
bacce86ae8a7 ("btrfs: drop unused argument step from btrfs_free_extra_devids")
96c2e067ed3e ("btrfs: skip devices without magic signature when mounting")
c3e1f96c37d0 ("btrfs: enumerate the type of exclusive operation in progress")
944d3f9fac61 ("btrfs: switch seed device to list api")
c4989c2fd0eb ("btrfs: simplify setting/clearing fs_info to btrfs_fs_devices")
54eed6ae8d8e ("btrfs: make close_fs_devices return void")
3712ccb7f1cc ("btrfs: factor out loop logic from btrfs_free_extra_devids")
dc0ab488d2cb ("btrfs: factor out reada loop in __reada_start_machine")
adca4d945c8d ("btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT")
3092c68fc58c ("btrfs: sysfs: add bdi link to the fsid directory")
998a0671961f ("btrfs: include non-missing as a qualifier for the latest_bdev")
1ed802c972c6 ("btrfs: drop useless goto in open_fs_devices")
b335eab890ed ("btrfs: make btrfs_read_disk_super return struct btrfs_disk_super")
c4a816c67c39 ("btrfs: introduce chunk allocation policy")
9a8658e33d8f ("btrfs: open code trivial helper btrfs_header_fsid")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 50d281fc434cb8e2497f5e70a309ccca6b1a09f0 Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain(a)oracle.com>
Date: Thu, 23 Mar 2023 15:56:48 +0800
Subject: [PATCH] btrfs: scan device in non-exclusive mode
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable(a)vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d0124b6e79e..ac0e8fb92fc8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1366,8 +1366,17 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
* So, we need to add a special mount option to scan for
* later supers, using BTRFS_SUPER_MIRROR_MAX instead
*/
- flags |= FMODE_EXCL;
+ /*
+ * Avoid using flag |= FMODE_EXCL here, as the systemd-udev may
+ * initiate the device scan which may race with the user's mount
+ * or mkfs command, resulting in failure.
+ * Since the device scan is solely for reading purposes, there is
+ * no need for FMODE_EXCL. Additionally, the devices are read again
+ * during the mount process. It is ok to get some inconsistent
+ * values temporarily, as the device paths of the fsid are the only
+ * required information for assembling the volume.
+ */
bdev = blkdev_get_by_path(path, flags, holder);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
On 4/3/23 12:10 PM, Matthew Wilcox wrote:
> On Sun, Apr 02, 2023 at 06:19:20AM +0800, Rongwei Wang wrote:
>> Without this modification, a core will wait (mostly)
>> 'swap_info_struct->lock' when completing
>> 'del_from_avail_list(p)'. Immediately, other cores
>> soon calling 'add_to_avail_list()' to add the same
>> object again when acquiring the lock that released
>> by former. It's not the desired result but exists
>> indeed. This case can be described as below:
> This feels like a very verbose way of saying
>
> "The si->lock must be held when deleting the si from the
> available list. Otherwise, another thread can re-add the
> si to the available list, which can lead to memory corruption.
> The only place we have found where this happens is in the
> swapoff path."
It looks better than mine. Sorry for my confusing description, it will
be fixed in the next version.
>
>> +++ b/mm/swapfile.c
>> @@ -2610,8 +2610,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> spin_unlock(&swap_lock);
>> goto out_dput;
>> }
>> - del_from_avail_list(p);
>> + /*
>> + * Here lock is used to protect deleting and SWP_WRITEOK clearing
>> + * can be seen concurrently.
>> + */
> This comment isn't necessary. But I would add a lockdep assert inside
> __del_from_avail_list() that p->lock is held.
Thanks. Actually, I have this line in previous test version, but delete
for saving one line of code.
I will update here as you said.
Thanks for your time.
>
>> spin_lock(&p->lock);
>> + del_from_avail_list(p);
>> if (p->prio < 0) {
>> struct swap_info_struct *si = p;
>> int nid;
>> --
>> 2.27.0
>>
>>
This is a proposal to revert commit 914eedcb9ba0ff53c33808.
I found this when writting a simple UFFDIO_API test to be the first unit
test in this set. Two things breaks with the commit:
- UFFDIO_API check was lost and missing. According to man page, the
kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This
check is needed if the api version will be extended in the future, or
user app won't be able to identify which is a new kernel.
- Feature flags checks were removed, which means UFFDIO_API with a
feature that does not exist will also succeed. According to the man
page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
unknown features passed in.
Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: linux-stable <stable(a)vger.kernel.org>
Signed-off-by: Peter Xu <peterx(a)redhat.com>
---
fs/userfaultfd.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8395605790f6..3b2a41c330e6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1977,8 +1977,10 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
ret = -EFAULT;
if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
goto out;
- /* Ignore unsupported features (userspace built against newer kernel) */
- features = uffdio_api.features & UFFD_API_FEATURES;
+ features = uffdio_api.features;
+ ret = -EINVAL;
+ if (uffdio_api.api != UFFD_API || (features & ~UFFD_API_FEATURES))
+ goto err_out;
ret = -EPERM;
if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE))
goto err_out;
--
2.39.1
Hello my beloved, good morning from here, how are you doing today? My
name is Mrs. Rabi Juanni Marcus, I have something very important that
i want to discuss with you.