Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: 08d4da79178d - Linux 5.14.3-rc1
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
Targeted tests: NO
All kernel binaries, config files, and logs are available for download here:
https://arr-cki-prod-datawarehouse-public.s3.amazonaws.com/index.html?prefi…
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
make options: make -j24 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: make -j24 INSTALL_MOD_STRIP=1 targz-pkg
s390x:
make options: make -j24 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: make -j24 INSTALL_MOD_STRIP=1 targz-pkg
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Reboot test
✅ ACPI table test
✅ ACPI enabled test
✅ LTP - cve
✅ LTP - sched
⚡⚡⚡ LTP - syscalls
⚡⚡⚡ LTP - can
⚡⚡⚡ LTP - commands
⚡⚡⚡ LTP - containers
⚡⚡⚡ LTP - dio
⚡⚡⚡ LTP - fs
⚡⚡⚡ LTP - fsx
⚡⚡⚡ LTP - math
⚡⚡⚡ LTP - hugetlb
⚡⚡⚡ LTP - mm
⚡⚡⚡ LTP - nptl
⚡⚡⚡ LTP - pty
⚡⚡⚡ LTP - ipc
⚡⚡⚡ LTP - tracing
⚡⚡⚡ CIFS Connectathon
⚡⚡⚡ POSIX pjd-fstest suites
⚡⚡⚡ NFS Connectathon
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ jvm - jcstress tests
⚡⚡⚡ Memory: fork_mem
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking cki netfilter test
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ pciutils: update pci ids test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: dm/common
⚡⚡⚡ storage: SCSI VPD
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ xarray-idr-radixtree-test
🚧 ⚡⚡⚡ i2c: i2cdetect sanity
🚧 ⚡⚡⚡ Firmware test suite
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ Networking: igmp conformance test
🚧 ⚡⚡⚡ audit: audit testsuite test
🚧 ⚡⚡⚡ lvm cache test
🚧 ⚡⚡⚡ lvm snapper test
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Reboot test
⚡⚡⚡ xfstests - ext4
⚡⚡⚡ xfstests - xfs
⚡⚡⚡ IPMI driver test
⚡⚡⚡ IPMItool loop stress test
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ Storage blktests - blk
⚡⚡⚡ Storage block - filesystem fio test
⚡⚡⚡ Storage block - queue scheduler test
⚡⚡⚡ storage: software RAID testing
⚡⚡⚡ Storage: swraid mdadm raid_module test
🚧 ⚡⚡⚡ Podman system integration test - as root
🚧 ⚡⚡⚡ Podman system integration test - as user
🚧 ⚡⚡⚡ xfstests - btrfs
🚧 ⚡⚡⚡ Storage blktests - nvme-tcp
🚧 ⚡⚡⚡ Storage blktests - nvmeof-mp
🚧 ⚡⚡⚡ Storage blktests - srp
🚧 ⚡⚡⚡ stress: stress-ng
ppc64le:
Host 1:
✅ Boot test
✅ Reboot test
✅ LTP - cve
✅ LTP - sched
✅ LTP - syscalls
✅ LTP - can
✅ LTP - commands
✅ LTP - containers
✅ LTP - dio
✅ LTP - fs
✅ LTP - fsx
✅ LTP - math
✅ LTP - hugetlb
✅ LTP - mm
✅ LTP - nptl
✅ LTP - pty
✅ LTP - ipc
✅ LTP - tracing
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ NFS Connectathon
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: dm/common
✅ trace: ftrace/tracer
🚧 ✅ xarray-idr-radixtree-test
🚧 ✅ Memory function: kaslr
🚧 ✅ audit: audit testsuite test
🚧 ✅ lvm cache test
🚧 ✅ lvm snapper test
Host 2:
✅ Boot test
✅ Reboot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ IPMI driver test
✅ IPMItool loop stress test
✅ selinux-policy: serge-testsuite
✅ Storage blktests - blk
✅ Storage block - filesystem fio test
✅ Storage block - queue scheduler test
✅ storage: software RAID testing
✅ Storage: swraid mdadm raid_module test
🚧 ✅ Podman system integration test - as root
🚧 ✅ Podman system integration test - as user
🚧 ✅ xfstests - btrfs
🚧 ✅ Storage blktests - nvme-tcp
🚧 ❌ Storage blktests - nvmeof-mp
🚧 ❌ Storage blktests - srp
🚧 ✅ Storage: lvm device-mapper test - upstream
s390x:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Reboot test
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ Storage blktests - blk
⚡⚡⚡ Storage: swraid mdadm raid_module test
🚧 ⚡⚡⚡ Podman system integration test - as root
🚧 ⚡⚡⚡ Podman system integration test - as user
🚧 ⚡⚡⚡ Storage blktests - nvme-tcp
🚧 ⚡⚡⚡ Storage blktests - nvmeof-mp
🚧 ⚡⚡⚡ Storage blktests - srp
🚧 ⚡⚡⚡ stress: stress-ng
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Reboot test
✅ LTP - cve
✅ LTP - sched
✅ LTP - syscalls
✅ LTP - can
✅ LTP - commands
✅ LTP - containers
✅ LTP - dio
✅ LTP - fs
✅ LTP - fsx
✅ LTP - math
✅ LTP - hugetlb
✅ LTP - mm
✅ LTP - nptl
✅ LTP - pty
✅ LTP - ipc
✅ LTP - tracing
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ NFS Connectathon
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
✅ storage: dm/common
✅ trace: ftrace/tracer
🚧 ❌ xarray-idr-radixtree-test
🚧 ✅ Memory function: kaslr
🚧 ✅ audit: audit testsuite test
🚧 ✅ lvm cache test
🚧 ✅ lvm snapper test
Host 3:
✅ Boot test
✅ Reboot test
✅ selinux-policy: serge-testsuite
✅ Storage blktests - blk
✅ Storage: swraid mdadm raid_module test
🚧 ✅ Podman system integration test - as root
🚧 ✅ Podman system integration test - as user
🚧 ✅ Storage blktests - nvme-tcp
🚧 ❌ Storage blktests - nvmeof-mp
🚧 ❌ Storage blktests - srp
🚧 ✅ stress: stress-ng
x86_64:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Reboot test
⚡⚡⚡ xfstests - ext4
⚡⚡⚡ xfstests - xfs
⚡⚡⚡ xfstests - nfsv4.2
⚡⚡⚡ xfstests - cifsv3.11
⚡⚡⚡ IPMI driver test
⚡⚡⚡ IPMItool loop stress test
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ power-management: cpupower/sanity test
⚡⚡⚡ Storage blktests - blk
⚡⚡⚡ Storage block - filesystem fio test
⚡⚡⚡ Storage block - queue scheduler test
⚡⚡⚡ storage: software RAID testing
⚡⚡⚡ Storage: swraid mdadm raid_module test
🚧 ⚡⚡⚡ Podman system integration test - as root
🚧 ⚡⚡⚡ Podman system integration test - as user
🚧 ⚡⚡⚡ CPU: Idle Test
🚧 ⚡⚡⚡ xfstests - btrfs
🚧 ⚡⚡⚡ Storage blktests - nvme-tcp
🚧 ⚡⚡⚡ Storage blktests - nvmeof-mp
🚧 ⚡⚡⚡ Storage blktests - srp
🚧 ⚡⚡⚡ Storage: lvm device-mapper test - upstream
🚧 ⚡⚡⚡ stress: stress-ng
Host 2:
✅ Boot test
✅ Reboot test
✅ ACPI table test
✅ LTP - cve
✅ LTP - sched
✅ LTP - syscalls
✅ LTP - can
✅ LTP - commands
✅ LTP - containers
✅ LTP - dio
✅ LTP - fs
✅ LTP - fsx
✅ LTP - math
✅ LTP - hugetlb
✅ LTP - mm
✅ LTP - nptl
✅ LTP - pty
✅ LTP - ipc
✅ LTP - tracing
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ NFS Connectathon
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: sanity smoke test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: dm/common
✅ storage: SCSI VPD
✅ trace: ftrace/tracer
🚧 ❌ xarray-idr-radixtree-test
🚧 ✅ i2c: i2cdetect sanity
🚧 ✅ Firmware test suite
🚧 ✅ Memory function: kaslr
🚧 ✅ Networking: igmp conformance test
🚧 ✅ audit: audit testsuite test
🚧 ✅ lvm cache test
🚧 ✅ lvm snapper test
Host 3:
✅ Boot test
✅ Reboot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ xfstests - nfsv4.2
✅ xfstests - cifsv3.11
✅ IPMI driver test
✅ IPMItool loop stress test
✅ selinux-policy: serge-testsuite
✅ power-management: cpupower/sanity test
✅ Storage blktests - blk
✅ Storage block - filesystem fio test
✅ Storage block - queue scheduler test
✅ storage: software RAID testing
✅ Storage: swraid mdadm raid_module test
🚧 ✅ Podman system integration test - as root
🚧 ✅ Podman system integration test - as user
🚧 ✅ CPU: Idle Test
🚧 ✅ xfstests - btrfs
🚧 ✅ Storage blktests - nvme-tcp
🚧 ❌ Storage blktests - nvmeof-mp
🚧 ❌ Storage blktests - srp
🚧 ✅ Storage: lvm device-mapper test - upstream
🚧 ✅ stress: stress-ng
Test sources: https://gitlab.com/cki-project/kernel-tests
💚 Pull requests are welcome for new tests or improvements to existing tests!
Aborted tests
-------------
Tests that didn't complete running successfully are marked with ⚡⚡⚡.
If this was caused by an infrastructure issue, we try to mark that
explicitly in the report.
Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.
Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running yet are marked with ⏱.
Targeted tests
--------------
Test runs for patches always include a set of base tests, plus some
tests chosen based on the file paths modified by the patch. The latter
are called "targeted tests". If no targeted tests are run, that means
no patch-specific tests are available. Please, consider contributing a
targeted test for related patches to increase test coverage. See
https://docs.engineering.redhat.com/x/_wEZB for more details.
From: Joerg Roedel <jroedel(a)suse.de>
For now, kexec is not supported when running as an SEV-ES guest. Doing
so requires additional hypervisor support and special code to hand
over the CPUs to the new kernel in a safe way.
Until this is implemented, do not support kexec in SEV-ES guests.
Cc: stable(a)vger.kernel.org # v5.10+
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
---
arch/x86/kernel/machine_kexec_64.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 131f30fdcfbd..a8e16a411b40 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -591,3 +591,11 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
*/
set_memory_encrypted((unsigned long)vaddr, pages);
}
+
+/*
+ * Kexec is not supported in SEV-ES guests yet
+ */
+bool arch_kexec_supported(void)
+{
+ return !sev_es_active();
+}
--
2.33.0
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 76224355db7570cbe6b6f75c8929a1558828dd55 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Tue, 17 Aug 2021 21:05:16 +0200
Subject: [PATCH] fuse: truncate pagecache on atomic_o_trunc
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().
Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: Xie Yongji <xieyongji(a)bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894(a)syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..5e5efb66c7d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -198,12 +198,11 @@ void fuse_finish_open(struct inode *inode, struct file *file)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!(ff->open_flags & FOPEN_KEEP_CACHE))
- invalidate_inode_pages2(inode->i_mapping);
if (ff->open_flags & FOPEN_STREAM)
stream_open(inode, file);
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
+
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -211,10 +210,14 @@ void fuse_finish_open(struct inode *inode, struct file *file)
fi->attr_version = atomic64_inc_return(&fc->attr_version);
i_size_write(inode, 0);
spin_unlock(&fi->lock);
+ truncate_pagecache(inode, 0);
fuse_invalidate_attr(inode);
if (fc->writeback_cache)
file_update_time(file);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) {
+ invalidate_inode_pages2(inode->i_mapping);
}
+
if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
fuse_link_write_file(file);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 76224355db7570cbe6b6f75c8929a1558828dd55 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Tue, 17 Aug 2021 21:05:16 +0200
Subject: [PATCH] fuse: truncate pagecache on atomic_o_trunc
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().
Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: Xie Yongji <xieyongji(a)bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894(a)syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..5e5efb66c7d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -198,12 +198,11 @@ void fuse_finish_open(struct inode *inode, struct file *file)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!(ff->open_flags & FOPEN_KEEP_CACHE))
- invalidate_inode_pages2(inode->i_mapping);
if (ff->open_flags & FOPEN_STREAM)
stream_open(inode, file);
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
+
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -211,10 +210,14 @@ void fuse_finish_open(struct inode *inode, struct file *file)
fi->attr_version = atomic64_inc_return(&fc->attr_version);
i_size_write(inode, 0);
spin_unlock(&fi->lock);
+ truncate_pagecache(inode, 0);
fuse_invalidate_attr(inode);
if (fc->writeback_cache)
file_update_time(file);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) {
+ invalidate_inode_pages2(inode->i_mapping);
}
+
if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
fuse_link_write_file(file);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 76224355db7570cbe6b6f75c8929a1558828dd55 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Tue, 17 Aug 2021 21:05:16 +0200
Subject: [PATCH] fuse: truncate pagecache on atomic_o_trunc
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().
Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: Xie Yongji <xieyongji(a)bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894(a)syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..5e5efb66c7d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -198,12 +198,11 @@ void fuse_finish_open(struct inode *inode, struct file *file)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!(ff->open_flags & FOPEN_KEEP_CACHE))
- invalidate_inode_pages2(inode->i_mapping);
if (ff->open_flags & FOPEN_STREAM)
stream_open(inode, file);
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
+
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -211,10 +210,14 @@ void fuse_finish_open(struct inode *inode, struct file *file)
fi->attr_version = atomic64_inc_return(&fc->attr_version);
i_size_write(inode, 0);
spin_unlock(&fi->lock);
+ truncate_pagecache(inode, 0);
fuse_invalidate_attr(inode);
if (fc->writeback_cache)
file_update_time(file);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) {
+ invalidate_inode_pages2(inode->i_mapping);
}
+
if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
fuse_link_write_file(file);
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ecc53c48c13d995e6fe5559e30ffee48d92784fd Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Sun, 29 Aug 2021 16:13:03 -0600
Subject: [PATCH] io-wq: check max_worker limits if a worker transitions bound
state
For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.
If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.
When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.
Cc: stable(a)vger.kernel.org
Reported-by: Johannes Lundberg <johalun0(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 4b5fc621ab39..da3ad45028f9 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -424,7 +424,28 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
spin_unlock(&wq->hash->wait.lock);
}
-static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
+/*
+ * We can always run the work if the worker is currently the same type as
+ * the work (eg both are bound, or both are unbound). If they are not the
+ * same, only allow it if incrementing the worker count would be allowed.
+ */
+static bool io_worker_can_run_work(struct io_worker *worker,
+ struct io_wq_work *work)
+{
+ struct io_wqe_acct *acct;
+
+ if (!(worker->flags & IO_WORKER_F_BOUND) !=
+ !(work->flags & IO_WQ_WORK_UNBOUND))
+ return true;
+
+ /* not the same type, check if we'd go over the limit */
+ acct = io_work_get_acct(worker->wqe, work);
+ return acct->nr_workers < acct->max_workers;
+}
+
+static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
+ struct io_worker *worker,
+ bool *stalled)
__must_hold(wqe->lock)
{
struct io_wq_work_node *node, *prev;
@@ -436,6 +457,9 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
work = container_of(node, struct io_wq_work, list);
+ if (!io_worker_can_run_work(worker, work))
+ break;
+
/* not hashed, can run anytime */
if (!io_wq_is_hashed(work)) {
wq_list_del(&wqe->work_list, node, prev);
@@ -462,6 +486,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
raw_spin_unlock(&wqe->lock);
io_wait_on_hash(wqe, stall_hash);
raw_spin_lock(&wqe->lock);
+ *stalled = true;
}
return NULL;
@@ -501,6 +526,7 @@ static void io_worker_handle_work(struct io_worker *worker)
do {
struct io_wq_work *work;
+ bool stalled;
get_next:
/*
* If we got some work, mark us as busy. If we didn't, but
@@ -509,10 +535,11 @@ static void io_worker_handle_work(struct io_worker *worker)
* can't make progress, any work completion or insertion will
* clear the stalled flag.
*/
- work = io_get_next_work(wqe);
+ stalled = false;
+ work = io_get_next_work(wqe, worker, &stalled);
if (work)
__io_worker_busy(wqe, worker, work);
- else if (!wq_list_empty(&wqe->work_list))
+ else if (stalled)
wqe->flags |= IO_WQE_FLAG_STALLED;
raw_spin_unlock_irq(&wqe->lock);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 94ffb0a282872c2f4b14f757fa1aef2302aeaabb Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 11:55:22 -0600
Subject: [PATCH] io-wq: fix race between adding work and activating a free
worker
The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.
Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index cd9bd095fb1b..94f8f2ecb8e5 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -236,9 +236,9 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
* We need a worker. If we find a free one, we're good. If not, and we're
* below the max number of workers, create one.
*/
-static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
+static void io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
{
- bool ret;
+ bool do_create = false, first = false;
/*
* Most likely an attempt to queue unbounded work on an io_wq that
@@ -247,26 +247,18 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
if (unlikely(!acct->max_workers))
pr_warn_once("io-wq is not configured for unbound workers");
- rcu_read_lock();
- ret = io_wqe_activate_free_worker(wqe);
- rcu_read_unlock();
-
- if (!ret) {
- bool do_create = false, first = false;
-
- raw_spin_lock(&wqe->lock);
- if (acct->nr_workers < acct->max_workers) {
- if (!acct->nr_workers)
- first = true;
- acct->nr_workers++;
- do_create = true;
- }
- raw_spin_unlock(&wqe->lock);
- if (do_create) {
- atomic_inc(&acct->nr_running);
- atomic_inc(&wqe->wq->worker_refs);
- create_io_worker(wqe->wq, wqe, acct->index, first);
- }
+ raw_spin_lock(&wqe->lock);
+ if (acct->nr_workers < acct->max_workers) {
+ if (!acct->nr_workers)
+ first = true;
+ acct->nr_workers++;
+ do_create = true;
+ }
+ raw_spin_unlock(&wqe->lock);
+ if (do_create) {
+ atomic_inc(&acct->nr_running);
+ atomic_inc(&wqe->wq->worker_refs);
+ create_io_worker(wqe->wq, wqe, acct->index, first);
}
}
@@ -794,7 +786,8 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- bool do_wake;
+ unsigned work_flags = work->flags;
+ bool do_create;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -809,12 +802,16 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
- do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running);
+
+ rcu_read_lock();
+ do_create = !io_wqe_activate_free_worker(wqe);
+ rcu_read_unlock();
+
raw_spin_unlock(&wqe->lock);
- if (do_wake)
- io_wqe_wake_worker(wqe, acct);
+ if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running)))
+ io_wqe_create_worker(wqe, acct);
}
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 94ffb0a282872c2f4b14f757fa1aef2302aeaabb Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 11:55:22 -0600
Subject: [PATCH] io-wq: fix race between adding work and activating a free
worker
The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.
Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index cd9bd095fb1b..94f8f2ecb8e5 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -236,9 +236,9 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
* We need a worker. If we find a free one, we're good. If not, and we're
* below the max number of workers, create one.
*/
-static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
+static void io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
{
- bool ret;
+ bool do_create = false, first = false;
/*
* Most likely an attempt to queue unbounded work on an io_wq that
@@ -247,26 +247,18 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
if (unlikely(!acct->max_workers))
pr_warn_once("io-wq is not configured for unbound workers");
- rcu_read_lock();
- ret = io_wqe_activate_free_worker(wqe);
- rcu_read_unlock();
-
- if (!ret) {
- bool do_create = false, first = false;
-
- raw_spin_lock(&wqe->lock);
- if (acct->nr_workers < acct->max_workers) {
- if (!acct->nr_workers)
- first = true;
- acct->nr_workers++;
- do_create = true;
- }
- raw_spin_unlock(&wqe->lock);
- if (do_create) {
- atomic_inc(&acct->nr_running);
- atomic_inc(&wqe->wq->worker_refs);
- create_io_worker(wqe->wq, wqe, acct->index, first);
- }
+ raw_spin_lock(&wqe->lock);
+ if (acct->nr_workers < acct->max_workers) {
+ if (!acct->nr_workers)
+ first = true;
+ acct->nr_workers++;
+ do_create = true;
+ }
+ raw_spin_unlock(&wqe->lock);
+ if (do_create) {
+ atomic_inc(&acct->nr_running);
+ atomic_inc(&wqe->wq->worker_refs);
+ create_io_worker(wqe->wq, wqe, acct->index, first);
}
}
@@ -794,7 +786,8 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- bool do_wake;
+ unsigned work_flags = work->flags;
+ bool do_create;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -809,12 +802,16 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
- do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running);
+
+ rcu_read_lock();
+ do_create = !io_wqe_activate_free_worker(wqe);
+ rcu_read_unlock();
+
raw_spin_unlock(&wqe->lock);
- if (do_wake)
- io_wqe_wake_worker(wqe, acct);
+ if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running)))
+ io_wqe_create_worker(wqe, acct);
}
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 94ffb0a282872c2f4b14f757fa1aef2302aeaabb Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 11:55:22 -0600
Subject: [PATCH] io-wq: fix race between adding work and activating a free
worker
The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.
Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index cd9bd095fb1b..94f8f2ecb8e5 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -236,9 +236,9 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
* We need a worker. If we find a free one, we're good. If not, and we're
* below the max number of workers, create one.
*/
-static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
+static void io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
{
- bool ret;
+ bool do_create = false, first = false;
/*
* Most likely an attempt to queue unbounded work on an io_wq that
@@ -247,26 +247,18 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
if (unlikely(!acct->max_workers))
pr_warn_once("io-wq is not configured for unbound workers");
- rcu_read_lock();
- ret = io_wqe_activate_free_worker(wqe);
- rcu_read_unlock();
-
- if (!ret) {
- bool do_create = false, first = false;
-
- raw_spin_lock(&wqe->lock);
- if (acct->nr_workers < acct->max_workers) {
- if (!acct->nr_workers)
- first = true;
- acct->nr_workers++;
- do_create = true;
- }
- raw_spin_unlock(&wqe->lock);
- if (do_create) {
- atomic_inc(&acct->nr_running);
- atomic_inc(&wqe->wq->worker_refs);
- create_io_worker(wqe->wq, wqe, acct->index, first);
- }
+ raw_spin_lock(&wqe->lock);
+ if (acct->nr_workers < acct->max_workers) {
+ if (!acct->nr_workers)
+ first = true;
+ acct->nr_workers++;
+ do_create = true;
+ }
+ raw_spin_unlock(&wqe->lock);
+ if (do_create) {
+ atomic_inc(&acct->nr_running);
+ atomic_inc(&wqe->wq->worker_refs);
+ create_io_worker(wqe->wq, wqe, acct->index, first);
}
}
@@ -794,7 +786,8 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- bool do_wake;
+ unsigned work_flags = work->flags;
+ bool do_create;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -809,12 +802,16 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
- do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running);
+
+ rcu_read_lock();
+ do_create = !io_wqe_activate_free_worker(wqe);
+ rcu_read_unlock();
+
raw_spin_unlock(&wqe->lock);
- if (do_wake)
- io_wqe_wake_worker(wqe, acct);
+ if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running)))
+ io_wqe_create_worker(wqe, acct);
}
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 87df7fb922d18e96992aa5e824aa34b2065fef59 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 07:45:47 -0600
Subject: [PATCH] io-wq: fix wakeup race when adding new work
When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:
CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed
Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 13aeb48a0964..cd9bd095fb1b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -794,7 +794,7 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- int work_flags;
+ bool do_wake;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -806,14 +806,14 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
return;
}
- work_flags = work->flags;
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
+ do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running);
raw_spin_unlock(&wqe->lock);
- if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running))
+ if (do_wake)
io_wqe_wake_worker(wqe, acct);
}
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 87df7fb922d18e96992aa5e824aa34b2065fef59 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 07:45:47 -0600
Subject: [PATCH] io-wq: fix wakeup race when adding new work
When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:
CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed
Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 13aeb48a0964..cd9bd095fb1b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -794,7 +794,7 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- int work_flags;
+ bool do_wake;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -806,14 +806,14 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
return;
}
- work_flags = work->flags;
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
+ do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running);
raw_spin_unlock(&wqe->lock);
- if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running))
+ if (do_wake)
io_wqe_wake_worker(wqe, acct);
}
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 87df7fb922d18e96992aa5e824aa34b2065fef59 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 30 Aug 2021 07:45:47 -0600
Subject: [PATCH] io-wq: fix wakeup race when adding new work
When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:
CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed
Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.
Cc: stable(a)vger.kernel.org
Reported-by: Andres Freund <andres(a)anarazel.de>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 13aeb48a0964..cd9bd095fb1b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -794,7 +794,7 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
{
struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
- int work_flags;
+ bool do_wake;
/*
* If io-wq is exiting for this task, or if the request has explicitly
@@ -806,14 +806,14 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
return;
}
- work_flags = work->flags;
raw_spin_lock(&wqe->lock);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
+ do_wake = (work->flags & IO_WQ_WORK_CONCURRENT) ||
+ !atomic_read(&acct->nr_running);
raw_spin_unlock(&wqe->lock);
- if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
- !atomic_read(&acct->nr_running))
+ if (do_wake)
io_wqe_wake_worker(wqe, acct);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3134cc8beb69d0db9de651081707c4651c011621 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Thu, 19 Aug 2021 19:03:05 +0100
Subject: [PATCH] KVM: arm64: vgic: Resample HW pending state on deactivation
When a mapped level interrupt (a timer, for example) is deactivated
by the guest, the corresponding host interrupt is equally deactivated.
However, the fate of the pending state still needs to be dealt
with in SW.
This is specially true when the interrupt was in the active+pending
state in the virtual distributor at the point where the guest
was entered. On exit, the pending state is potentially stale
(the guest may have put the interrupt in a non-pending state).
If we don't do anything, the interrupt will be spuriously injected
in the guest. Although this shouldn't have any ill effect (spurious
interrupts are always possible), we can improve the emulation by
detecting the deactivation-while-pending case and resample the
interrupt.
While we're at it, move the logic into a common helper that can
be shared between the two GIC implementations.
Fixes: e40cc57bac79 ("KVM: arm/arm64: vgic: Support level-triggered mapped interrupts")
Reported-by: Raghavendra Rao Ananta <rananta(a)google.com>
Tested-by: Raghavendra Rao Ananta <rananta(a)google.com>
Reviewed-by: Oliver Upton <oupton(a)google.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20210819180305.1670525-1-maz@kernel.org
diff --git a/arch/arm64/kvm/vgic/vgic-v2.c b/arch/arm64/kvm/vgic/vgic-v2.c
index 2c580204f1dc..95a18cec14a3 100644
--- a/arch/arm64/kvm/vgic/vgic-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-v2.c
@@ -60,6 +60,7 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
u32 val = cpuif->vgic_lr[lr];
u32 cpuid, intid = val & GICH_LR_VIRTUALID;
struct vgic_irq *irq;
+ bool deactivated;
/* Extract the source vCPU id from the LR */
cpuid = val & GICH_LR_PHYSID_CPUID;
@@ -75,7 +76,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT);
irq->active = !!(val & GICH_LR_ACTIVE_BIT);
if (irq->active && vgic_irq_is_sgi(intid))
@@ -96,36 +98,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & GICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & GICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 66004f61cd83..21a6207fb2ee 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -46,6 +46,7 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
u32 intid, cpuid;
struct vgic_irq *irq;
bool is_v2_sgi = false;
+ bool deactivated;
cpuid = val & GICH_LR_PHYSID_CPUID;
cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT;
@@ -68,7 +69,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT);
irq->active = !!(val & ICH_LR_ACTIVE_BIT);
if (irq->active && is_v2_sgi)
@@ -89,36 +91,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & ICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 81cec508d413..5dad4996cfb2 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -1021,3 +1021,41 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid)
return map_is_active;
}
+
+/*
+ * Level-triggered mapped IRQs are special because we only observe rising
+ * edges as input to the VGIC.
+ *
+ * If the guest never acked the interrupt we have to sample the physical
+ * line and set the line level, because the device state could have changed
+ * or we simply need to process the still pending interrupt later.
+ *
+ * We could also have entered the guest with the interrupt active+pending.
+ * On the next exit, we need to re-evaluate the pending state, as it could
+ * otherwise result in a spurious interrupt by injecting a now potentially
+ * stale pending state.
+ *
+ * If this causes us to lower the level, we have to also clear the physical
+ * active state, since we will otherwise never be told when the interrupt
+ * becomes asserted again.
+ *
+ * Another case is when the interrupt requires a helping hand on
+ * deactivation (no HW deactivation, for example).
+ */
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending)
+{
+ if (vgic_irq_is_mapped_level(irq)) {
+ bool resample = false;
+
+ if (unlikely(vgic_irq_needs_resampling(irq))) {
+ resample = !(irq->active || irq->pending_latch);
+ } else if (lr_pending || (lr_deactivated && irq->line_level)) {
+ irq->line_level = vgic_get_phys_line_level(irq);
+ resample = !irq->line_level;
+ }
+
+ if (resample)
+ vgic_irq_set_phys_active(irq, false);
+ }
+}
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index dc1f3d1657ee..14a9218641f5 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -169,6 +169,8 @@ void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active);
bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
unsigned long flags);
void vgic_kick_vcpus(struct kvm *kvm);
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending);
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3134cc8beb69d0db9de651081707c4651c011621 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Thu, 19 Aug 2021 19:03:05 +0100
Subject: [PATCH] KVM: arm64: vgic: Resample HW pending state on deactivation
When a mapped level interrupt (a timer, for example) is deactivated
by the guest, the corresponding host interrupt is equally deactivated.
However, the fate of the pending state still needs to be dealt
with in SW.
This is specially true when the interrupt was in the active+pending
state in the virtual distributor at the point where the guest
was entered. On exit, the pending state is potentially stale
(the guest may have put the interrupt in a non-pending state).
If we don't do anything, the interrupt will be spuriously injected
in the guest. Although this shouldn't have any ill effect (spurious
interrupts are always possible), we can improve the emulation by
detecting the deactivation-while-pending case and resample the
interrupt.
While we're at it, move the logic into a common helper that can
be shared between the two GIC implementations.
Fixes: e40cc57bac79 ("KVM: arm/arm64: vgic: Support level-triggered mapped interrupts")
Reported-by: Raghavendra Rao Ananta <rananta(a)google.com>
Tested-by: Raghavendra Rao Ananta <rananta(a)google.com>
Reviewed-by: Oliver Upton <oupton(a)google.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20210819180305.1670525-1-maz@kernel.org
diff --git a/arch/arm64/kvm/vgic/vgic-v2.c b/arch/arm64/kvm/vgic/vgic-v2.c
index 2c580204f1dc..95a18cec14a3 100644
--- a/arch/arm64/kvm/vgic/vgic-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-v2.c
@@ -60,6 +60,7 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
u32 val = cpuif->vgic_lr[lr];
u32 cpuid, intid = val & GICH_LR_VIRTUALID;
struct vgic_irq *irq;
+ bool deactivated;
/* Extract the source vCPU id from the LR */
cpuid = val & GICH_LR_PHYSID_CPUID;
@@ -75,7 +76,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT);
irq->active = !!(val & GICH_LR_ACTIVE_BIT);
if (irq->active && vgic_irq_is_sgi(intid))
@@ -96,36 +98,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & GICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & GICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 66004f61cd83..21a6207fb2ee 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -46,6 +46,7 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
u32 intid, cpuid;
struct vgic_irq *irq;
bool is_v2_sgi = false;
+ bool deactivated;
cpuid = val & GICH_LR_PHYSID_CPUID;
cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT;
@@ -68,7 +69,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT);
irq->active = !!(val & ICH_LR_ACTIVE_BIT);
if (irq->active && is_v2_sgi)
@@ -89,36 +91,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & ICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 81cec508d413..5dad4996cfb2 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -1021,3 +1021,41 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid)
return map_is_active;
}
+
+/*
+ * Level-triggered mapped IRQs are special because we only observe rising
+ * edges as input to the VGIC.
+ *
+ * If the guest never acked the interrupt we have to sample the physical
+ * line and set the line level, because the device state could have changed
+ * or we simply need to process the still pending interrupt later.
+ *
+ * We could also have entered the guest with the interrupt active+pending.
+ * On the next exit, we need to re-evaluate the pending state, as it could
+ * otherwise result in a spurious interrupt by injecting a now potentially
+ * stale pending state.
+ *
+ * If this causes us to lower the level, we have to also clear the physical
+ * active state, since we will otherwise never be told when the interrupt
+ * becomes asserted again.
+ *
+ * Another case is when the interrupt requires a helping hand on
+ * deactivation (no HW deactivation, for example).
+ */
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending)
+{
+ if (vgic_irq_is_mapped_level(irq)) {
+ bool resample = false;
+
+ if (unlikely(vgic_irq_needs_resampling(irq))) {
+ resample = !(irq->active || irq->pending_latch);
+ } else if (lr_pending || (lr_deactivated && irq->line_level)) {
+ irq->line_level = vgic_get_phys_line_level(irq);
+ resample = !irq->line_level;
+ }
+
+ if (resample)
+ vgic_irq_set_phys_active(irq, false);
+ }
+}
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index dc1f3d1657ee..14a9218641f5 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -169,6 +169,8 @@ void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active);
bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
unsigned long flags);
void vgic_kick_vcpus(struct kvm *kvm);
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending);
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3134cc8beb69d0db9de651081707c4651c011621 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Thu, 19 Aug 2021 19:03:05 +0100
Subject: [PATCH] KVM: arm64: vgic: Resample HW pending state on deactivation
When a mapped level interrupt (a timer, for example) is deactivated
by the guest, the corresponding host interrupt is equally deactivated.
However, the fate of the pending state still needs to be dealt
with in SW.
This is specially true when the interrupt was in the active+pending
state in the virtual distributor at the point where the guest
was entered. On exit, the pending state is potentially stale
(the guest may have put the interrupt in a non-pending state).
If we don't do anything, the interrupt will be spuriously injected
in the guest. Although this shouldn't have any ill effect (spurious
interrupts are always possible), we can improve the emulation by
detecting the deactivation-while-pending case and resample the
interrupt.
While we're at it, move the logic into a common helper that can
be shared between the two GIC implementations.
Fixes: e40cc57bac79 ("KVM: arm/arm64: vgic: Support level-triggered mapped interrupts")
Reported-by: Raghavendra Rao Ananta <rananta(a)google.com>
Tested-by: Raghavendra Rao Ananta <rananta(a)google.com>
Reviewed-by: Oliver Upton <oupton(a)google.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20210819180305.1670525-1-maz@kernel.org
diff --git a/arch/arm64/kvm/vgic/vgic-v2.c b/arch/arm64/kvm/vgic/vgic-v2.c
index 2c580204f1dc..95a18cec14a3 100644
--- a/arch/arm64/kvm/vgic/vgic-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-v2.c
@@ -60,6 +60,7 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
u32 val = cpuif->vgic_lr[lr];
u32 cpuid, intid = val & GICH_LR_VIRTUALID;
struct vgic_irq *irq;
+ bool deactivated;
/* Extract the source vCPU id from the LR */
cpuid = val & GICH_LR_PHYSID_CPUID;
@@ -75,7 +76,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT);
irq->active = !!(val & GICH_LR_ACTIVE_BIT);
if (irq->active && vgic_irq_is_sgi(intid))
@@ -96,36 +98,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & GICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & GICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 66004f61cd83..21a6207fb2ee 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -46,6 +46,7 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
u32 intid, cpuid;
struct vgic_irq *irq;
bool is_v2_sgi = false;
+ bool deactivated;
cpuid = val & GICH_LR_PHYSID_CPUID;
cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT;
@@ -68,7 +69,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT);
irq->active = !!(val & ICH_LR_ACTIVE_BIT);
if (irq->active && is_v2_sgi)
@@ -89,36 +91,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & ICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 81cec508d413..5dad4996cfb2 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -1021,3 +1021,41 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid)
return map_is_active;
}
+
+/*
+ * Level-triggered mapped IRQs are special because we only observe rising
+ * edges as input to the VGIC.
+ *
+ * If the guest never acked the interrupt we have to sample the physical
+ * line and set the line level, because the device state could have changed
+ * or we simply need to process the still pending interrupt later.
+ *
+ * We could also have entered the guest with the interrupt active+pending.
+ * On the next exit, we need to re-evaluate the pending state, as it could
+ * otherwise result in a spurious interrupt by injecting a now potentially
+ * stale pending state.
+ *
+ * If this causes us to lower the level, we have to also clear the physical
+ * active state, since we will otherwise never be told when the interrupt
+ * becomes asserted again.
+ *
+ * Another case is when the interrupt requires a helping hand on
+ * deactivation (no HW deactivation, for example).
+ */
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending)
+{
+ if (vgic_irq_is_mapped_level(irq)) {
+ bool resample = false;
+
+ if (unlikely(vgic_irq_needs_resampling(irq))) {
+ resample = !(irq->active || irq->pending_latch);
+ } else if (lr_pending || (lr_deactivated && irq->line_level)) {
+ irq->line_level = vgic_get_phys_line_level(irq);
+ resample = !irq->line_level;
+ }
+
+ if (resample)
+ vgic_irq_set_phys_active(irq, false);
+ }
+}
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index dc1f3d1657ee..14a9218641f5 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -169,6 +169,8 @@ void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active);
bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
unsigned long flags);
void vgic_kick_vcpus(struct kvm *kvm);
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending);
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment);
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3134cc8beb69d0db9de651081707c4651c011621 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Thu, 19 Aug 2021 19:03:05 +0100
Subject: [PATCH] KVM: arm64: vgic: Resample HW pending state on deactivation
When a mapped level interrupt (a timer, for example) is deactivated
by the guest, the corresponding host interrupt is equally deactivated.
However, the fate of the pending state still needs to be dealt
with in SW.
This is specially true when the interrupt was in the active+pending
state in the virtual distributor at the point where the guest
was entered. On exit, the pending state is potentially stale
(the guest may have put the interrupt in a non-pending state).
If we don't do anything, the interrupt will be spuriously injected
in the guest. Although this shouldn't have any ill effect (spurious
interrupts are always possible), we can improve the emulation by
detecting the deactivation-while-pending case and resample the
interrupt.
While we're at it, move the logic into a common helper that can
be shared between the two GIC implementations.
Fixes: e40cc57bac79 ("KVM: arm/arm64: vgic: Support level-triggered mapped interrupts")
Reported-by: Raghavendra Rao Ananta <rananta(a)google.com>
Tested-by: Raghavendra Rao Ananta <rananta(a)google.com>
Reviewed-by: Oliver Upton <oupton(a)google.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20210819180305.1670525-1-maz@kernel.org
diff --git a/arch/arm64/kvm/vgic/vgic-v2.c b/arch/arm64/kvm/vgic/vgic-v2.c
index 2c580204f1dc..95a18cec14a3 100644
--- a/arch/arm64/kvm/vgic/vgic-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-v2.c
@@ -60,6 +60,7 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
u32 val = cpuif->vgic_lr[lr];
u32 cpuid, intid = val & GICH_LR_VIRTUALID;
struct vgic_irq *irq;
+ bool deactivated;
/* Extract the source vCPU id from the LR */
cpuid = val & GICH_LR_PHYSID_CPUID;
@@ -75,7 +76,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT);
irq->active = !!(val & GICH_LR_ACTIVE_BIT);
if (irq->active && vgic_irq_is_sgi(intid))
@@ -96,36 +98,8 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & GICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & GICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 66004f61cd83..21a6207fb2ee 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -46,6 +46,7 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
u32 intid, cpuid;
struct vgic_irq *irq;
bool is_v2_sgi = false;
+ bool deactivated;
cpuid = val & GICH_LR_PHYSID_CPUID;
cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT;
@@ -68,7 +69,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
raw_spin_lock(&irq->irq_lock);
- /* Always preserve the active bit */
+ /* Always preserve the active bit, note deactivation */
+ deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT);
irq->active = !!(val & ICH_LR_ACTIVE_BIT);
if (irq->active && is_v2_sgi)
@@ -89,36 +91,8 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu)
if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE))
irq->pending_latch = false;
- /*
- * Level-triggered mapped IRQs are special because we only
- * observe rising edges as input to the VGIC.
- *
- * If the guest never acked the interrupt we have to sample
- * the physical line and set the line level, because the
- * device state could have changed or we simply need to
- * process the still pending interrupt later.
- *
- * If this causes us to lower the level, we have to also clear
- * the physical active state, since we will otherwise never be
- * told when the interrupt becomes asserted again.
- *
- * Another case is when the interrupt requires a helping hand
- * on deactivation (no HW deactivation, for example).
- */
- if (vgic_irq_is_mapped_level(irq)) {
- bool resample = false;
-
- if (val & ICH_LR_PENDING_BIT) {
- irq->line_level = vgic_get_phys_line_level(irq);
- resample = !irq->line_level;
- } else if (vgic_irq_needs_resampling(irq) &&
- !(irq->active || irq->pending_latch)) {
- resample = true;
- }
-
- if (resample)
- vgic_irq_set_phys_active(irq, false);
- }
+ /* Handle resampling for mapped interrupts if required */
+ vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT);
raw_spin_unlock(&irq->irq_lock);
vgic_put_irq(vcpu->kvm, irq);
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 81cec508d413..5dad4996cfb2 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -1021,3 +1021,41 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid)
return map_is_active;
}
+
+/*
+ * Level-triggered mapped IRQs are special because we only observe rising
+ * edges as input to the VGIC.
+ *
+ * If the guest never acked the interrupt we have to sample the physical
+ * line and set the line level, because the device state could have changed
+ * or we simply need to process the still pending interrupt later.
+ *
+ * We could also have entered the guest with the interrupt active+pending.
+ * On the next exit, we need to re-evaluate the pending state, as it could
+ * otherwise result in a spurious interrupt by injecting a now potentially
+ * stale pending state.
+ *
+ * If this causes us to lower the level, we have to also clear the physical
+ * active state, since we will otherwise never be told when the interrupt
+ * becomes asserted again.
+ *
+ * Another case is when the interrupt requires a helping hand on
+ * deactivation (no HW deactivation, for example).
+ */
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending)
+{
+ if (vgic_irq_is_mapped_level(irq)) {
+ bool resample = false;
+
+ if (unlikely(vgic_irq_needs_resampling(irq))) {
+ resample = !(irq->active || irq->pending_latch);
+ } else if (lr_pending || (lr_deactivated && irq->line_level)) {
+ irq->line_level = vgic_get_phys_line_level(irq);
+ resample = !irq->line_level;
+ }
+
+ if (resample)
+ vgic_irq_set_phys_active(irq, false);
+ }
+}
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index dc1f3d1657ee..14a9218641f5 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -169,6 +169,8 @@ void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active);
bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
unsigned long flags);
void vgic_kick_vcpus(struct kvm *kvm);
+void vgic_irq_handle_resampling(struct vgic_irq *irq,
+ bool lr_deactivated, bool lr_pending);
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f7782bb8d818d8f47c26b22079db10599922787a Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 10 Aug 2021 07:45:26 -0700
Subject: [PATCH] KVM: nVMX: Unconditionally clear nested.pi_pending on nested
VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled. If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.
Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag. Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt. Fixing the mess is a future problem.
Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled. Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb89657 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable(a)vger.kernel.org
Cc: Jim Mattson <jmattson(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210810144526.2662272-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 264a9f4c9179..bc6327950657 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2187,12 +2187,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
~PIN_BASED_VMX_PREEMPTION_TIMER);
/* Posted interrupts setting is only taken from vmcs12. */
- if (nested_cpu_has_posted_intr(vmcs12)) {
+ vmx->nested.pi_pending = false;
+ if (nested_cpu_has_posted_intr(vmcs12))
vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
- vmx->nested.pi_pending = false;
- } else {
+ else
exec_control &= ~PIN_BASED_POSTED_INTR;
- }
pin_controls_set(vmx, exec_control);
/*
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f7782bb8d818d8f47c26b22079db10599922787a Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 10 Aug 2021 07:45:26 -0700
Subject: [PATCH] KVM: nVMX: Unconditionally clear nested.pi_pending on nested
VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled. If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.
Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag. Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt. Fixing the mess is a future problem.
Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled. Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb89657 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable(a)vger.kernel.org
Cc: Jim Mattson <jmattson(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210810144526.2662272-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 264a9f4c9179..bc6327950657 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2187,12 +2187,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
~PIN_BASED_VMX_PREEMPTION_TIMER);
/* Posted interrupts setting is only taken from vmcs12. */
- if (nested_cpu_has_posted_intr(vmcs12)) {
+ vmx->nested.pi_pending = false;
+ if (nested_cpu_has_posted_intr(vmcs12))
vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
- vmx->nested.pi_pending = false;
- } else {
+ else
exec_control &= ~PIN_BASED_POSTED_INTR;
- }
pin_controls_set(vmx, exec_control);
/*
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f7782bb8d818d8f47c26b22079db10599922787a Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 10 Aug 2021 07:45:26 -0700
Subject: [PATCH] KVM: nVMX: Unconditionally clear nested.pi_pending on nested
VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled. If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.
Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag. Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt. Fixing the mess is a future problem.
Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled. Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb89657 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable(a)vger.kernel.org
Cc: Jim Mattson <jmattson(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210810144526.2662272-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 264a9f4c9179..bc6327950657 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2187,12 +2187,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
~PIN_BASED_VMX_PREEMPTION_TIMER);
/* Posted interrupts setting is only taken from vmcs12. */
- if (nested_cpu_has_posted_intr(vmcs12)) {
+ vmx->nested.pi_pending = false;
+ if (nested_cpu_has_posted_intr(vmcs12))
vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
- vmx->nested.pi_pending = false;
- } else {
+ else
exec_control &= ~PIN_BASED_POSTED_INTR;
- }
pin_controls_set(vmx, exec_control);
/*
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f7782bb8d818d8f47c26b22079db10599922787a Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 10 Aug 2021 07:45:26 -0700
Subject: [PATCH] KVM: nVMX: Unconditionally clear nested.pi_pending on nested
VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled. If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.
Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag. Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt. Fixing the mess is a future problem.
Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled. Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb89657 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable(a)vger.kernel.org
Cc: Jim Mattson <jmattson(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210810144526.2662272-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 264a9f4c9179..bc6327950657 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2187,12 +2187,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
~PIN_BASED_VMX_PREEMPTION_TIMER);
/* Posted interrupts setting is only taken from vmcs12. */
- if (nested_cpu_has_posted_intr(vmcs12)) {
+ vmx->nested.pi_pending = false;
+ if (nested_cpu_has_posted_intr(vmcs12))
vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
- vmx->nested.pi_pending = false;
- } else {
+ else
exec_control &= ~PIN_BASED_POSTED_INTR;
- }
pin_controls_set(vmx, exec_control);
/*
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 81b4b56d4f8130bbb99cf4e2b48082e5b4cfccb9 Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Thu, 26 Aug 2021 12:57:49 +0300
Subject: [PATCH] KVM: VMX: avoid running vmx_handle_exit_irqoff in case of
emulation
If we are emulating an invalid guest state, we don't have a correct
exit reason, and thus we shouldn't do anything in this function.
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20210826095750.1650467-2-mlevitsk(a)redhat.com>
Cc: stable(a)vger.kernel.org
Fixes: 95b5a48c4f2b ("KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn", 2019-06-18)
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fada1055f325..0c2c0d5ae873 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6382,6 +6382,9 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ if (vmx->emulation_required)
+ return;
+
if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
handle_external_interrupt_irqoff(vcpu);
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ec607a564f70519b340f7eb4cfc0f4a6b55285ac Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Fri, 6 Aug 2021 07:05:58 -0400
Subject: [PATCH] KVM: x86: clamp host mapping level to max_level in
kvm_mmu_max_mapping_level
This change started as a way to make kvm_mmu_hugepage_adjust a bit simpler,
but it does fix two bugs as well.
One bug is in zapping collapsible PTEs. If a large page size is
disallowed but not all of them, kvm_mmu_max_mapping_level will return the
host mapping level and the small PTEs will be zapped up to that level.
However, if e.g. 1GB are prohibited, we can still zap 4KB mapping and
preserve the 2MB ones. This can happen for example when NX huge pages
are in use.
The second would happen when userspace backs guest memory
with a 1gb hugepage but only assign a subset of the page to
the guest. 1gb pages would be disallowed by the memslot, but
not 2mb. kvm_mmu_max_mapping_level() would fall through to the
host_pfn_mapping_level() logic, see the 1gb hugepage, and map the whole
thing into the guest.
Fixes: 2f57b7051fe8 ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54cb15e4b550..bfd2705a7291 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2910,6 +2910,7 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
kvm_pfn_t pfn, int max_level)
{
struct kvm_lpage_info *linfo;
+ int host_level;
max_level = min(max_level, max_huge_page_level);
for ( ; max_level > PG_LEVEL_4K; max_level--) {
@@ -2921,7 +2922,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
- return host_pfn_mapping_level(kvm, gfn, pfn, slot);
+ host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
+ return min(host_level, max_level);
}
int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -2945,17 +2947,12 @@ int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
if (!slot)
return PG_LEVEL_4K;
- level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, gfn, pfn, max_level);
- if (level == PG_LEVEL_4K)
- return level;
-
- *req_level = level = min(level, max_level);
-
/*
* Enforce the iTLB multihit workaround after capturing the requested
* level, which will be used to do precise, accurate accounting.
*/
- if (huge_page_disallowed)
+ *req_level = level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, gfn, pfn, max_level);
+ if (level == PG_LEVEL_4K || huge_page_disallowed)
return PG_LEVEL_4K;
/*
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3e03bc1368c1bc16e19b001fc96dc7430573cc8 Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 27 Aug 2021 14:54:29 +0200
Subject: [PATCH] KVM: s390: index kvm->arch.idle_mask by vcpu_idx
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array")
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Reviewed-by: Christian Bornträger <borntraeger(a)de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> # 3.15+
Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 118d5450c523..611f18ecde91 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -963,6 +963,7 @@ struct kvm_arch{
atomic64_t cmma_dirty_pages;
/* subset of available cpu features enabled by user space */
DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS);
+ /* indexed by vcpu_idx */
DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS);
struct kvm_s390_gisa_interrupt gisa_int;
struct kvm_s390_pv pv;
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index d548d60caed2..16256e17a544 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -419,13 +419,13 @@ static unsigned long deliverable_irqs(struct kvm_vcpu *vcpu)
static void __set_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT);
- set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __unset_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT);
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __reset_intercept_indicators(struct kvm_vcpu *vcpu)
@@ -3050,18 +3050,18 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len)
static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask)
{
- int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus);
+ int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus);
struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int;
struct kvm_vcpu *vcpu;
- for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
- vcpu = kvm_get_vcpu(kvm, vcpu_id);
+ for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) {
+ vcpu = kvm_get_vcpu(kvm, vcpu_idx);
if (psw_ioint_disabled(vcpu))
continue;
deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24);
if (deliverable_mask) {
/* lately kicked but not yet running */
- if (test_and_set_bit(vcpu_id, gi->kicked_mask))
+ if (test_and_set_bit(vcpu_idx, gi->kicked_mask))
return;
kvm_s390_vcpu_wakeup(vcpu);
return;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5b45c83ced21..e144c8046ceb 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4026,7 +4026,7 @@ static int vcpu_pre_run(struct kvm_vcpu *vcpu)
kvm_s390_patch_guest_per_regs(vcpu);
}
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.gisa_int.kicked_mask);
vcpu->arch.sie_block->icptcode = 0;
cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..ecd741ee3276 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -79,7 +79,7 @@ static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
{
- return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ return test_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static inline int kvm_is_ucontrol(struct kvm *kvm)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3e03bc1368c1bc16e19b001fc96dc7430573cc8 Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 27 Aug 2021 14:54:29 +0200
Subject: [PATCH] KVM: s390: index kvm->arch.idle_mask by vcpu_idx
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array")
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Reviewed-by: Christian Bornträger <borntraeger(a)de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> # 3.15+
Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 118d5450c523..611f18ecde91 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -963,6 +963,7 @@ struct kvm_arch{
atomic64_t cmma_dirty_pages;
/* subset of available cpu features enabled by user space */
DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS);
+ /* indexed by vcpu_idx */
DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS);
struct kvm_s390_gisa_interrupt gisa_int;
struct kvm_s390_pv pv;
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index d548d60caed2..16256e17a544 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -419,13 +419,13 @@ static unsigned long deliverable_irqs(struct kvm_vcpu *vcpu)
static void __set_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT);
- set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __unset_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT);
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __reset_intercept_indicators(struct kvm_vcpu *vcpu)
@@ -3050,18 +3050,18 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len)
static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask)
{
- int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus);
+ int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus);
struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int;
struct kvm_vcpu *vcpu;
- for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
- vcpu = kvm_get_vcpu(kvm, vcpu_id);
+ for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) {
+ vcpu = kvm_get_vcpu(kvm, vcpu_idx);
if (psw_ioint_disabled(vcpu))
continue;
deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24);
if (deliverable_mask) {
/* lately kicked but not yet running */
- if (test_and_set_bit(vcpu_id, gi->kicked_mask))
+ if (test_and_set_bit(vcpu_idx, gi->kicked_mask))
return;
kvm_s390_vcpu_wakeup(vcpu);
return;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5b45c83ced21..e144c8046ceb 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4026,7 +4026,7 @@ static int vcpu_pre_run(struct kvm_vcpu *vcpu)
kvm_s390_patch_guest_per_regs(vcpu);
}
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.gisa_int.kicked_mask);
vcpu->arch.sie_block->icptcode = 0;
cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..ecd741ee3276 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -79,7 +79,7 @@ static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
{
- return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ return test_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static inline int kvm_is_ucontrol(struct kvm *kvm)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3e03bc1368c1bc16e19b001fc96dc7430573cc8 Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 27 Aug 2021 14:54:29 +0200
Subject: [PATCH] KVM: s390: index kvm->arch.idle_mask by vcpu_idx
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array")
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Reviewed-by: Christian Bornträger <borntraeger(a)de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> # 3.15+
Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 118d5450c523..611f18ecde91 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -963,6 +963,7 @@ struct kvm_arch{
atomic64_t cmma_dirty_pages;
/* subset of available cpu features enabled by user space */
DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS);
+ /* indexed by vcpu_idx */
DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS);
struct kvm_s390_gisa_interrupt gisa_int;
struct kvm_s390_pv pv;
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index d548d60caed2..16256e17a544 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -419,13 +419,13 @@ static unsigned long deliverable_irqs(struct kvm_vcpu *vcpu)
static void __set_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT);
- set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __unset_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT);
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __reset_intercept_indicators(struct kvm_vcpu *vcpu)
@@ -3050,18 +3050,18 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len)
static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask)
{
- int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus);
+ int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus);
struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int;
struct kvm_vcpu *vcpu;
- for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
- vcpu = kvm_get_vcpu(kvm, vcpu_id);
+ for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) {
+ vcpu = kvm_get_vcpu(kvm, vcpu_idx);
if (psw_ioint_disabled(vcpu))
continue;
deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24);
if (deliverable_mask) {
/* lately kicked but not yet running */
- if (test_and_set_bit(vcpu_id, gi->kicked_mask))
+ if (test_and_set_bit(vcpu_idx, gi->kicked_mask))
return;
kvm_s390_vcpu_wakeup(vcpu);
return;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5b45c83ced21..e144c8046ceb 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4026,7 +4026,7 @@ static int vcpu_pre_run(struct kvm_vcpu *vcpu)
kvm_s390_patch_guest_per_regs(vcpu);
}
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.gisa_int.kicked_mask);
vcpu->arch.sie_block->icptcode = 0;
cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..ecd741ee3276 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -79,7 +79,7 @@ static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
{
- return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ return test_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static inline int kvm_is_ucontrol(struct kvm *kvm)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3e03bc1368c1bc16e19b001fc96dc7430573cc8 Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 27 Aug 2021 14:54:29 +0200
Subject: [PATCH] KVM: s390: index kvm->arch.idle_mask by vcpu_idx
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array")
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Reviewed-by: Christian Bornträger <borntraeger(a)de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> # 3.15+
Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 118d5450c523..611f18ecde91 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -963,6 +963,7 @@ struct kvm_arch{
atomic64_t cmma_dirty_pages;
/* subset of available cpu features enabled by user space */
DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS);
+ /* indexed by vcpu_idx */
DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS);
struct kvm_s390_gisa_interrupt gisa_int;
struct kvm_s390_pv pv;
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index d548d60caed2..16256e17a544 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -419,13 +419,13 @@ static unsigned long deliverable_irqs(struct kvm_vcpu *vcpu)
static void __set_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT);
- set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __unset_cpu_idle(struct kvm_vcpu *vcpu)
{
kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT);
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static void __reset_intercept_indicators(struct kvm_vcpu *vcpu)
@@ -3050,18 +3050,18 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len)
static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask)
{
- int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus);
+ int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus);
struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int;
struct kvm_vcpu *vcpu;
- for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
- vcpu = kvm_get_vcpu(kvm, vcpu_id);
+ for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) {
+ vcpu = kvm_get_vcpu(kvm, vcpu_idx);
if (psw_ioint_disabled(vcpu))
continue;
deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24);
if (deliverable_mask) {
/* lately kicked but not yet running */
- if (test_and_set_bit(vcpu_id, gi->kicked_mask))
+ if (test_and_set_bit(vcpu_idx, gi->kicked_mask))
return;
kvm_s390_vcpu_wakeup(vcpu);
return;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5b45c83ced21..e144c8046ceb 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4026,7 +4026,7 @@ static int vcpu_pre_run(struct kvm_vcpu *vcpu)
kvm_s390_patch_guest_per_regs(vcpu);
}
- clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask);
+ clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.gisa_int.kicked_mask);
vcpu->arch.sie_block->icptcode = 0;
cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..ecd741ee3276 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -79,7 +79,7 @@ static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
{
- return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask);
+ return test_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask);
}
static inline int kvm_is_ucontrol(struct kvm *kvm)
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2ae2eb9dde18979b40629dd413b9adbd6c894cdf Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 9 Sep 2021 13:56:27 +0100
Subject: [PATCH] io_uring: fail links of cancelled timeouts
When we cancel a timeout we should mark it with REQ_F_FAIL, so
linked requests are cancelled as well, but not queued for further
execution.
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.16311921…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b21a423a4de8..ffd91844b2e5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1482,6 +1482,8 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
struct io_timeout_data *io = req->async_data;
if (hrtimer_try_to_cancel(&io->timer) != -1) {
+ if (status)
+ req_set_fail(req);
atomic_set(&req->ctx->cq_timeouts,
atomic_read(&req->ctx->cq_timeouts) + 1);
list_del_init(&req->timeout.list);
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89c2b3b74918200e46699338d7bcc19b1ea12110 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 11:18:45 +0100
Subject: [PATCH] io_uring: reexpand under-reexpanded iters
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae
old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());
If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().
Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.
Cc: stable(a)vger.kernel.org
Reported-by: Palash Oswal <oswalpalash(a)gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee(a)gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d94fb5835a20..71639d39caf9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3281,6 +3281,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (req->flags & REQ_F_NOWAIT)
goto done;
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = 0;
} else if (ret == -EIOCBQUEUED) {
@@ -3420,6 +3421,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
copy_iov:
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
return ret ?: -EAGAIN;
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89c2b3b74918200e46699338d7bcc19b1ea12110 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 11:18:45 +0100
Subject: [PATCH] io_uring: reexpand under-reexpanded iters
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae
old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());
If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().
Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.
Cc: stable(a)vger.kernel.org
Reported-by: Palash Oswal <oswalpalash(a)gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee(a)gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d94fb5835a20..71639d39caf9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3281,6 +3281,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (req->flags & REQ_F_NOWAIT)
goto done;
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = 0;
} else if (ret == -EIOCBQUEUED) {
@@ -3420,6 +3421,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
copy_iov:
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
return ret ?: -EAGAIN;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2ae2eb9dde18979b40629dd413b9adbd6c894cdf Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 9 Sep 2021 13:56:27 +0100
Subject: [PATCH] io_uring: fail links of cancelled timeouts
When we cancel a timeout we should mark it with REQ_F_FAIL, so
linked requests are cancelled as well, but not queued for further
execution.
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.16311921…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b21a423a4de8..ffd91844b2e5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1482,6 +1482,8 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
struct io_timeout_data *io = req->async_data;
if (hrtimer_try_to_cancel(&io->timer) != -1) {
+ if (status)
+ req_set_fail(req);
atomic_set(&req->ctx->cq_timeouts,
atomic_read(&req->ctx->cq_timeouts) + 1);
list_del_init(&req->timeout.list);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89c2b3b74918200e46699338d7bcc19b1ea12110 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 11:18:45 +0100
Subject: [PATCH] io_uring: reexpand under-reexpanded iters
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae
old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());
If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().
Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.
Cc: stable(a)vger.kernel.org
Reported-by: Palash Oswal <oswalpalash(a)gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee(a)gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d94fb5835a20..71639d39caf9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3281,6 +3281,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (req->flags & REQ_F_NOWAIT)
goto done;
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = 0;
} else if (ret == -EIOCBQUEUED) {
@@ -3420,6 +3421,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
copy_iov:
/* some cases will consume bytes even on error returns */
+ iov_iter_reexpand(iter, iter->count + iter->truncated);
iov_iter_revert(iter, io_size - iov_iter_count(iter));
ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
return ret ?: -EAGAIN;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From dadebc350da2bef62593b1df007a6e0b90baf42a Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 13:30:44 +0100
Subject: [PATCH] io_uring: fix io_try_cancel_userdata race for iowq
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.
note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+b0c9d1588ae92866515f(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.16297217…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 827e60ae4909..6859438c4e09 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5863,7 +5863,7 @@ static int io_try_cancel_userdata(struct io_kiocb *req, u64 sqe_addr)
struct io_ring_ctx *ctx = req->ctx;
int ret;
- WARN_ON_ONCE(req->task != current);
+ WARN_ON_ONCE(!io_wq_current_is_worker() && req->task != current);
ret = io_async_cancel_one(req->task->io_uring, sqe_addr, ctx);
if (ret != -ENOENT)
@@ -6369,6 +6369,7 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ /* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL)
ret = -ECANCELED;
@@ -9184,8 +9185,8 @@ static void io_uring_clean_tctx(struct io_uring_task *tctx)
* Must be after io_uring_del_task_file() (removes nodes under
* uring_lock) to avoid race with io_uring_try_cancel_iowq().
*/
- tctx->io_wq = NULL;
io_wq_put_and_exit(wq);
+ tctx->io_wq = NULL;
}
}
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From dadebc350da2bef62593b1df007a6e0b90baf42a Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 13:30:44 +0100
Subject: [PATCH] io_uring: fix io_try_cancel_userdata race for iowq
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.
note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+b0c9d1588ae92866515f(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.16297217…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 827e60ae4909..6859438c4e09 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5863,7 +5863,7 @@ static int io_try_cancel_userdata(struct io_kiocb *req, u64 sqe_addr)
struct io_ring_ctx *ctx = req->ctx;
int ret;
- WARN_ON_ONCE(req->task != current);
+ WARN_ON_ONCE(!io_wq_current_is_worker() && req->task != current);
ret = io_async_cancel_one(req->task->io_uring, sqe_addr, ctx);
if (ret != -ENOENT)
@@ -6369,6 +6369,7 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ /* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL)
ret = -ECANCELED;
@@ -9184,8 +9185,8 @@ static void io_uring_clean_tctx(struct io_uring_task *tctx)
* Must be after io_uring_del_task_file() (removes nodes under
* uring_lock) to avoid race with io_uring_try_cancel_iowq().
*/
- tctx->io_wq = NULL;
io_wq_put_and_exit(wq);
+ tctx->io_wq = NULL;
}
}
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From dadebc350da2bef62593b1df007a6e0b90baf42a Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 23 Aug 2021 13:30:44 +0100
Subject: [PATCH] io_uring: fix io_try_cancel_userdata race for iowq
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.
note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+b0c9d1588ae92866515f(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.16297217…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 827e60ae4909..6859438c4e09 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5863,7 +5863,7 @@ static int io_try_cancel_userdata(struct io_kiocb *req, u64 sqe_addr)
struct io_ring_ctx *ctx = req->ctx;
int ret;
- WARN_ON_ONCE(req->task != current);
+ WARN_ON_ONCE(!io_wq_current_is_worker() && req->task != current);
ret = io_async_cancel_one(req->task->io_uring, sqe_addr, ctx);
if (ret != -ENOENT)
@@ -6369,6 +6369,7 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ /* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL)
ret = -ECANCELED;
@@ -9184,8 +9185,8 @@ static void io_uring_clean_tctx(struct io_uring_task *tctx)
* Must be after io_uring_del_task_file() (removes nodes under
* uring_lock) to avoid race with io_uring_try_cancel_iowq().
*/
- tctx->io_wq = NULL;
io_wq_put_and_exit(wq);
+ tctx->io_wq = NULL;
}
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0bea96f59ba40e63c0ae93ad6a02417b95f22f4d Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Fri, 20 Aug 2021 10:36:36 +0100
Subject: [PATCH] io_uring: place fixed tables under memcg limits
Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.16294516…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9a021fe6c461..c6f1afa5ec04 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7138,14 +7138,14 @@ static void **io_alloc_page_table(size_t size)
size_t init_size = size;
void **table;
- table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL);
+ table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL_ACCOUNT);
if (!table)
return NULL;
for (i = 0; i < nr_tables; i++) {
unsigned int this_size = min_t(size_t, size, PAGE_SIZE);
- table[i] = kzalloc(this_size, GFP_KERNEL);
+ table[i] = kzalloc(this_size, GFP_KERNEL_ACCOUNT);
if (!table[i]) {
io_free_page_table(table, init_size);
return NULL;
@@ -7336,7 +7336,8 @@ static int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put,
static bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files)
{
- table->files = kvcalloc(nr_files, sizeof(table->files[0]), GFP_KERNEL);
+ table->files = kvcalloc(nr_files, sizeof(table->files[0]),
+ GFP_KERNEL_ACCOUNT);
return !!table->files;
}
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0bea96f59ba40e63c0ae93ad6a02417b95f22f4d Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Fri, 20 Aug 2021 10:36:36 +0100
Subject: [PATCH] io_uring: place fixed tables under memcg limits
Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.16294516…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9a021fe6c461..c6f1afa5ec04 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7138,14 +7138,14 @@ static void **io_alloc_page_table(size_t size)
size_t init_size = size;
void **table;
- table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL);
+ table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL_ACCOUNT);
if (!table)
return NULL;
for (i = 0; i < nr_tables; i++) {
unsigned int this_size = min_t(size_t, size, PAGE_SIZE);
- table[i] = kzalloc(this_size, GFP_KERNEL);
+ table[i] = kzalloc(this_size, GFP_KERNEL_ACCOUNT);
if (!table[i]) {
io_free_page_table(table, init_size);
return NULL;
@@ -7336,7 +7336,8 @@ static int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put,
static bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files)
{
- table->files = kvcalloc(nr_files, sizeof(table->files[0]), GFP_KERNEL);
+ table->files = kvcalloc(nr_files, sizeof(table->files[0]),
+ GFP_KERNEL_ACCOUNT);
return !!table->files;
}
The patch below does not apply to the 5.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0bea96f59ba40e63c0ae93ad6a02417b95f22f4d Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Fri, 20 Aug 2021 10:36:36 +0100
Subject: [PATCH] io_uring: place fixed tables under memcg limits
Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.16294516…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9a021fe6c461..c6f1afa5ec04 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7138,14 +7138,14 @@ static void **io_alloc_page_table(size_t size)
size_t init_size = size;
void **table;
- table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL);
+ table = kcalloc(nr_tables, sizeof(*table), GFP_KERNEL_ACCOUNT);
if (!table)
return NULL;
for (i = 0; i < nr_tables; i++) {
unsigned int this_size = min_t(size_t, size, PAGE_SIZE);
- table[i] = kzalloc(this_size, GFP_KERNEL);
+ table[i] = kzalloc(this_size, GFP_KERNEL_ACCOUNT);
if (!table[i]) {
io_free_page_table(table, init_size);
return NULL;
@@ -7336,7 +7336,8 @@ static int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put,
static bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files)
{
- table->files = kvcalloc(nr_files, sizeof(table->files[0]), GFP_KERNEL);
+ table->files = kvcalloc(nr_files, sizeof(table->files[0]),
+ GFP_KERNEL_ACCOUNT);
return !!table->files;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3a1b8a4e843f96b636431450d8d79061605cf74b Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Fri, 20 Aug 2021 10:36:35 +0100
Subject: [PATCH] io_uring: limit fixed table size by RLIMIT_NOFILE
Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
that's the first and the simpliest restriction that we should impose.
Cc: stable(a)vger.kernel.org
Suggested-by: Jens Axboe <axboe(a)kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.16294516…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 706ac8c03b95..9a021fe6c461 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7733,6 +7733,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
return -EINVAL;
if (nr_args > IORING_MAX_FIXED_FILES)
return -EMFILE;
+ if (nr_args > rlimit(RLIMIT_NOFILE))
+ return -EMFILE;
ret = io_rsrc_node_switch_start(ctx);
if (ret)
return ret;
This series backports some patches that failed to apply to 4.19-stable
due to the prototype of inode_operations::getattr having changed in
v5.12, as well as several other conflicts. Please apply to 4.19-stable.
Eric Biggers (4):
fscrypt: add fscrypt_symlink_getattr() for computing st_size
ext4: report correct st_size for encrypted symlinks
f2fs: report correct st_size for encrypted symlinks
ubifs: report correct st_size for encrypted symlinks
fs/crypto/hooks.c | 44 +++++++++++++++++++++++++++++++++
fs/ext4/symlink.c | 11 ++++++++-
fs/f2fs/namei.c | 11 ++++++++-
fs/ubifs/file.c | 12 ++++++++-
include/linux/fscrypt_notsupp.h | 6 +++++
include/linux/fscrypt_supp.h | 1 +
6 files changed, 82 insertions(+), 3 deletions(-)
--
2.33.0.153.gba50c8fa24-goog
From: Jaegeuk Kim <jaegeuk(a)kernel.org>
commit dddd3d65293a52c2c3850c19b1e5115712e534d8 upstream.
We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
---
fs/f2fs/file.c | 5 ++---
fs/f2fs/super.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5c74b2997197..6ee8b1e0e174 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -259,8 +259,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
};
unsigned int seq_id = 0;
- if (unlikely(f2fs_readonly(inode->i_sb) ||
- is_sbi_flag_set(sbi, SBI_CP_DISABLED)))
+ if (unlikely(f2fs_readonly(inode->i_sb)))
return 0;
trace_f2fs_sync_file_enter(inode);
@@ -274,7 +273,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
ret = file_write_and_wait_range(file, start, end);
clear_inode_flag(inode, FI_NEED_IPU);
- if (ret) {
+ if (ret || is_sbi_flag_set(sbi, SBI_CP_DISABLED)) {
trace_f2fs_sync_file_exit(inode, cp_reason, datasync, ret);
return ret;
}
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index c52988067887..476b2c497d28 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1764,8 +1764,17 @@ static int f2fs_disable_checkpoint(struct f2fs_sb_info *sbi)
static void f2fs_enable_checkpoint(struct f2fs_sb_info *sbi)
{
+ int retry = DEFAULT_RETRY_IO_COUNT;
+
/* we should flush all the data to keep data consistency */
- sync_inodes_sb(sbi->sb);
+ do {
+ sync_inodes_sb(sbi->sb);
+ cond_resched();
+ congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT);
+ } while (get_pages(sbi, F2FS_DIRTY_DATA) && retry--);
+
+ if (unlikely(retry < 0))
+ f2fs_warn(sbi, "checkpoint=enable has some unwritten data.");
down_write(&sbi->gc_lock);
f2fs_dirty_to_prefree(sbi);
--
2.33.0.153.gba50c8fa24-goog
commit 2f32c147a3816d789722c0bd242a9431332ec3ed upstream.
This patch adds a missing pci-id to support the wireless device in the
Galaxy Book Flex2 Alpha which is shipping and in the wild. It has been
tested to work on 5.13+. While it has been shipping in Fedora kernels,
Samsung and a couple of users requested it be added to stable.
Justin
From: Guillaume Nault <gnault(a)redhat.com>
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.
rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.
Signed-off-by: Guillaume Nault <gnault(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
A nice side-effect of replacing spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() in peernet2id() is that it avoids the
situation where SoftIRQs get enabled whilst IRQs are turned off.
>From bugzilla.redhat.com/show_bug.cgi?id=1384179 (an ancient
4.9.0-0.rc0 kernel):
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
__local_bh_enable_ip+0x9d/0xc0
_raw_spin_unlock_bh+0x35/0x40
peernet2id+0x54/0x80
netlink_broadcast_filtered+0x220/0x3c0
netlink_broadcast+0x1d/0x20
audit_log+0x6a/0x90
security_set_bools+0xee/0x200
[]
Note, security_set_bools() calls write_lock_irq(). peernet2id() calls
spin_unlock_bh().
>From an internal (UEK) stack trace based on the v4.14.35 kernel (LTS
4.14.231):
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
path_rec_start+0xa3/0x140 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
arp_rcv+0x140/0x1c9
__netif_receive_skb_core+0x401/0xb39
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
rdma_resolve_route+0x287/0x810 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
ret_from_fork+0x24/0x49
Here, pay attention to ib_nl_make_request() which calls
spin_lock_irqsave() on a global lock just before calling
rdma_nl_multicast(). Thereafter, peernet2id() enables SoftIRQs, and
ipoib starts and calls the same path and ends up trying to acquire the
same global lock again.
(cherry picked from commit 2dce224f469f060b9998a5a869151ef83c08ce77)
Fixes: fba143c66abb ("netns: avoid disabling irq for netns id")
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Conflicts:
net/core/net_namespace.c
* Due to context differences because v5.4 lacks commit
4905294162bd ("netns: Remove __peernet2id_alloc()").
Only comments affected.
---
net/core/net_namespace.c | 28 ++++++++++------------------
1 file changed, 10 insertions(+), 18 deletions(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c303873..9bf1551 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -211,9 +211,9 @@ static int net_eq_idr(int id, void *net, void *peer)
return 0;
}
-/* Should be called with nsid_lock held. If a new id is assigned, the bool alloc
- * is set to true, thus the caller knows that the new id must be notified via
- * rtnl.
+/* Must be called from RCU-critical section or with nsid_lock held. If
+ * a new id is assigned, the bool alloc is set to true, thus the
+ * caller knows that the new id must be notified via rtnl.
*/
static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
{
@@ -237,7 +237,7 @@ static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
return NETNSA_NSID_NOT_ASSIGNED;
}
-/* should be called with nsid_lock held */
+/* Must be called from RCU-critical section or with nsid_lock held */
static int __peernet2id(struct net *net, struct net *peer)
{
bool no = false;
@@ -281,9 +281,10 @@ int peernet2id(struct net *net, struct net *peer)
{
int id;
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
id = __peernet2id(net, peer);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
+
return id;
}
EXPORT_SYMBOL(peernet2id);
@@ -962,6 +963,7 @@ struct rtnl_net_dump_cb {
int s_idx;
};
+/* Runs in RCU-critical section. */
static int rtnl_net_dumpid_one(int id, void *peer, void *data)
{
struct rtnl_net_dump_cb *net_cb = (struct rtnl_net_dump_cb *)data;
@@ -1046,19 +1048,9 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
goto end;
}
- spin_lock_bh(&net_cb.tgt_net->nsid_lock);
- if (net_cb.fillargs.add_ref &&
- !net_eq(net_cb.ref_net, net_cb.tgt_net) &&
- !spin_trylock_bh(&net_cb.ref_net->nsid_lock)) {
- spin_unlock_bh(&net_cb.tgt_net->nsid_lock);
- err = -EAGAIN;
- goto end;
- }
+ rcu_read_lock();
idr_for_each(&net_cb.tgt_net->netns_ids, rtnl_net_dumpid_one, &net_cb);
- if (net_cb.fillargs.add_ref &&
- !net_eq(net_cb.ref_net, net_cb.tgt_net))
- spin_unlock_bh(&net_cb.ref_net->nsid_lock);
- spin_unlock_bh(&net_cb.tgt_net->nsid_lock);
+ rcu_read_unlock();
cb->args[0] = net_cb.idx;
end:
--
1.8.3.1
I'm announcing the release of the 5.14.3 kernel.
All users of the 5.14 kernel series must upgrade.
The updated 5.14.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.14.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/x86/kernel/reboot.c | 3 -
drivers/bluetooth/btusb.c | 59 +++++++++++++-----------
drivers/cxl/acpi.c | 12 +++-
drivers/cxl/pci.c | 6 +-
drivers/firmware/dmi-id.c | 6 ++
drivers/net/can/c_can/c_can_ethtool.c | 4 -
drivers/net/ethernet/realtek/r8169_main.c | 1
drivers/pci/quirks.c | 12 ++--
drivers/staging/mt7621-pci/pci-mt7621.c | 13 ++++-
drivers/usb/cdns3/cdnsp-mem.c | 2
drivers/usb/gadget/udc/tegra-xudc.c | 4 -
drivers/usb/host/xhci-debugfs.c | 14 ++++-
drivers/usb/host/xhci-mtk-sch.c | 10 ++--
drivers/usb/host/xhci-rcar.c | 7 ++
drivers/usb/host/xhci-ring.c | 43 +++++++++++------
drivers/usb/host/xhci-trace.h | 26 ++++++----
drivers/usb/host/xhci.h | 73 ++++++++++++++----------------
drivers/usb/mtu3/mtu3_core.c | 4 +
drivers/usb/mtu3/mtu3_gadget.c | 6 --
net/ipv4/igmp.c | 2
sound/usb/quirks.c | 1
22 files changed, 184 insertions(+), 126 deletions(-)
Alexander Tsoy (1):
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Alison Schofield (1):
cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports
Chunfeng Yun (6):
usb: xhci-mtk: fix issue of out-of-bounds array access
usb: cdnsp: fix the wrong mult value for HS isoc or intr
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
usb: mtu3: restore HS function when set SS/SSP
usb: mtu3: use @mult for HS isoc or intr
usb: mtu3: fix the wrong HS mult value
Dan Williams (1):
cxl/pci: Fix lockdown level
Greg Kroah-Hartman (1):
Linux 5.14.3
Hans de Goede (1):
firmware: dmi: Move product_sku info to the end of the modalias
Hayes Wang (1):
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Ismael Ferreras Morezuelas (1):
Bluetooth: btusb: Make the CSR clone chip force-suspend workaround more generic
Larry Finger (1):
Bluetooth: Add additional Bluetooth part for Realtek 8852AE
Li Qiang (Johnny Li) (1):
cxl/pci: Fix debug message in cxl_probe_regs()
Liu Jian (1):
igmp: Add ip_mc_list lock in ip_check_mc_rcu
Marek Behún (1):
PCI: Call Max Payload Size-related fixup quirks early
Mathias Nyman (3):
xhci: fix even more unsafe memory usage in xhci tracing
xhci: fix unsafe memory usage in xhci tracing
xhci: Fix failure to give back some cached cancelled URBs.
Paul Gortmaker (1):
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Sergio Paracuellos (1):
staging: mt7621-pci: fix hang when nothing is connected to pcie ports
Tong Zhang (1):
can: c_can: fix null-ptr-deref on ioctl()
Yoshihiro Shimoda (1):
usb: host: xhci-rcar: Don't reload firmware after the completion
I'm announcing the release of the 5.13.16 kernel.
All users of the 5.13 kernel series must upgrade.
The updated 5.13.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.13.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/x86/kernel/reboot.c | 3 -
block/blk-core.c | 1
block/blk-flush.c | 13 ++++
block/blk-mq.c | 37 +++++++++++++-
block/blk.h | 6 --
drivers/bluetooth/btusb.c | 59 +++++++++++++---------
drivers/firmware/dmi-id.c | 6 +-
drivers/net/ethernet/realtek/r8169_main.c | 1
drivers/net/ethernet/xilinx/ll_temac_main.c | 4 -
drivers/pci/quirks.c | 12 ++--
drivers/usb/cdns3/cdnsp-mem.c | 2
drivers/usb/gadget/udc/tegra-xudc.c | 4 -
drivers/usb/host/xhci-debugfs.c | 14 +++--
drivers/usb/host/xhci-mtk-sch.c | 10 ++-
drivers/usb/host/xhci-rcar.c | 7 ++
drivers/usb/host/xhci-ring.c | 43 ++++++++++------
drivers/usb/host/xhci-trace.h | 26 +++++----
drivers/usb/host/xhci.h | 73 +++++++++++++---------------
drivers/usb/mtu3/mtu3_core.c | 4 +
drivers/usb/mtu3/mtu3_gadget.c | 6 --
net/ipv4/igmp.c | 2
sound/usb/quirks.c | 1
23 files changed, 211 insertions(+), 125 deletions(-)
Alexander Tsoy (1):
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Chunfeng Yun (6):
usb: xhci-mtk: fix issue of out-of-bounds array access
usb: cdnsp: fix the wrong mult value for HS isoc or intr
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
usb: mtu3: restore HS function when set SS/SSP
usb: mtu3: use @mult for HS isoc or intr
usb: mtu3: fix the wrong HS mult value
Esben Haabendal (1):
net: ll_temac: Remove left-over debug message
Greg Kroah-Hartman (1):
Linux 5.13.16
Hans de Goede (1):
firmware: dmi: Move product_sku info to the end of the modalias
Hayes Wang (1):
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Ismael Ferreras Morezuelas (1):
Bluetooth: btusb: Make the CSR clone chip force-suspend workaround more generic
Larry Finger (1):
Bluetooth: Add additional Bluetooth part for Realtek 8852AE
Liu Jian (1):
igmp: Add ip_mc_list lock in ip_check_mc_rcu
Marek Behún (1):
PCI: Call Max Payload Size-related fixup quirks early
Mathias Nyman (3):
xhci: fix even more unsafe memory usage in xhci tracing
xhci: fix unsafe memory usage in xhci tracing
xhci: Fix failure to give back some cached cancelled URBs.
Ming Lei (3):
blk-mq: fix kernel panic during iterating over flush request
blk-mq: fix is_flush_rq
blk-mq: clearing flush request reference in tags->rqs[]
Paul Gortmaker (1):
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Yoshihiro Shimoda (1):
usb: host: xhci-rcar: Don't reload firmware after the completion
I'm announcing the release of the 5.10.64 kernel.
All users of the 5.10 kernel series must upgrade.
The updated 5.10.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.10.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/x86/events/amd/iommu.c | 47 ++++++++------
arch/x86/kernel/reboot.c | 3
block/blk-core.c | 1
block/blk-flush.c | 13 ++++
block/blk-mq.c | 37 +++++++++++
block/blk.h | 6 -
drivers/net/ethernet/realtek/r8169_main.c | 1
drivers/net/ethernet/xilinx/ll_temac_main.c | 4 -
drivers/pci/quirks.c | 12 +--
drivers/tty/serial/8250/8250_omap.c | 26 ++++----
drivers/usb/gadget/udc/tegra-xudc.c | 4 -
drivers/usb/host/xhci-debugfs.c | 14 +++-
drivers/usb/host/xhci-rcar.c | 7 ++
drivers/usb/host/xhci-ring.c | 3
drivers/usb/host/xhci-trace.h | 26 ++++----
drivers/usb/host/xhci.h | 73 +++++++++++-----------
drivers/usb/mtu3/mtu3_core.c | 4 -
drivers/usb/mtu3/mtu3_gadget.c | 6 -
drivers/usb/serial/mos7720.c | 4 -
include/linux/skbuff.h | 4 -
include/uapi/linux/termios.h | 15 ----
lib/Kconfig.debug | 2
mm/page_alloc.c | 8 +-
net/ipv4/igmp.c | 2
net/netfilter/nf_tables_api.c | 89 +++++++++++++++++-----------
net/netfilter/nft_set_hash.c | 10 +--
sound/usb/quirks.c | 1
28 files changed, 251 insertions(+), 173 deletions(-)
Alexander Tsoy (1):
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Chunfeng Yun (4):
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
usb: mtu3: restore HS function when set SS/SSP
usb: mtu3: use @mult for HS isoc or intr
usb: mtu3: fix the wrong HS mult value
Eric Dumazet (1):
netfilter: nftables: avoid potential overflows on 32bit arches
Esben Haabendal (1):
net: ll_temac: Remove left-over debug message
Greg Kroah-Hartman (1):
Linux 5.10.64
Hayes Wang (1):
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Jiri Slaby (1):
tty: drop termiox user definitions
Liu Jian (1):
igmp: Add ip_mc_list lock in ip_check_mc_rcu
Marek Behún (1):
PCI: Call Max Payload Size-related fixup quirks early
Mathias Nyman (2):
xhci: fix even more unsafe memory usage in xhci tracing
xhci: fix unsafe memory usage in xhci tracing
Ming Lei (3):
blk-mq: fix kernel panic during iterating over flush request
blk-mq: fix is_flush_rq
blk-mq: clearing flush request reference in tags->rqs[]
Muchun Song (1):
mm/page_alloc: speed up the iteration of max_order
Pablo Neira Ayuso (2):
netfilter: nf_tables: initialize set before expression setup
netfilter: nftables: clone set element expression template
Paul Gortmaker (1):
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Randy Dunlap (2):
net: kcov: don't select SKB_EXTENSIONS when there is no NET
net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling
Suravee Suthikulpanit (1):
x86/events/amd/iommu: Fix invalid Perf result due to IOMMU PMC power-gating
Tom Rix (1):
USB: serial: mos7720: improve OOM-handling in read_mos_reg()
Vignesh Raghavendra (1):
serial: 8250: 8250_omap: Fix unused variable warning
Yoshihiro Shimoda (1):
usb: host: xhci-rcar: Don't reload firmware after the completion
Some Intel chipsets disconnect the time and date RTC registers when the
clock update is in progress: during this time reads may return bogus
values and writes fail silently. This includes the RTC alarm registers.
[1]
cmos_read_alarm() and cmos_set_alarm() did not take account for that,
which caused alarm time reads to sometimes return bogus values. This can
be shown with a test patch that I am attaching to this patch series.
Setting the alarm clock also probably did fail sometimes.
To make this patch suitable for inclusion in stable kernels, I'm using a
simple method for avoiding the RTC update cycle. This method is used in
mach_set_rtc_mmss() in arch/x86/kernel/rtc.c. A more elaborate algorithm
- as in mc146818_get_time() in drivers/rtc/rtc-mc146818-lib.c - would be
too complcated for stable. [2]
cmos_wait_for_uip_clear() has the rtc_lock taken while waiting for the
UIP bit to become clear. This should be harmless as during the UIP the RTC
cannot be read from anyway. mach_get_cmos_time() in arch/x86/kernel/rtc.c
does things the same way.
[1] 7th Generation Intel ® Processor Family I/O for U/Y Platforms [...]
Datasheet, Volume 1 of 2 (Intel's Document Number: 334658-006)
Page 208
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/7th…
"If a RAM read from the ten time and date bytes is attempted
during an update cycle, the value read do not necessarily
represent the true contents of those locations. Any RAM writes
under the same conditions are ignored.'
[2] I'm going to submit a unification patch for a later kernel release -
prefer to see this in stable.
Signed-off-by: Mateusz Jończyk <mat.jonczyk(a)o2.pl>
Cc: Alessandro Zummo <a.zummo(a)towertech.it>
Cc: Alexandre Belloni <alexandre.belloni(a)bootlin.com>
Cc: stable(a)vger.kernel.org
---
drivers/rtc/rtc-cmos.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 2cd0fe728ab2..643433d984ab 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -248,6 +248,31 @@ static int cmos_set_time(struct device *dev, struct rtc_time *t)
return mc146818_set_time(t);
}
+/* Some Intel chipsets disconnect the alarm registers when the clock update is
+ * in progress - during this time reads return bogus values and writes may fail
+ * silently. See for example "7th Generation Intel® Processor Family I/O for
+ * U/Y Platforms [...] Datasheet", section 27.7.1
+ *
+ * Check the UIP bit to prevent this, waiting for max 10ms for it to become
+ * clear.
+ *
+ * This function has to be called with rtc_lock taken.
+ */
+static int cmos_wait_for_uip_clear(struct device *dev)
+{
+ int i;
+
+ for (i = 0; i < 100; i++) {
+ if ((CMOS_READ(RTC_FREQ_SELECT) & RTC_UIP) == 0)
+ return 0;
+ udelay(100);
+ }
+
+ dev_warn_ratelimited(dev, "UIP bit is stuck, cannot access RTC registers\n");
+
+ return 1;
+}
+
static int cmos_read_alarm(struct device *dev, struct rtc_wkalrm *t)
{
struct cmos_rtc *cmos = dev_get_drvdata(dev);
@@ -257,12 +282,17 @@ static int cmos_read_alarm(struct device *dev, struct rtc_wkalrm *t)
if (!is_valid_irq(cmos->irq))
return -EIO;
+ spin_lock_irq(&rtc_lock);
+
+ if (cmos_wait_for_uip_clear(dev)) {
+ spin_unlock_irq(&rtc_lock);
+ return -EIO;
+ }
+
/* Basic alarms only support hour, minute, and seconds fields.
* Some also support day and month, for alarms up to a year in
* the future.
*/
-
- spin_lock_irq(&rtc_lock);
t->time.tm_sec = CMOS_READ(RTC_SECONDS_ALARM);
t->time.tm_min = CMOS_READ(RTC_MINUTES_ALARM);
t->time.tm_hour = CMOS_READ(RTC_HOURS_ALARM);
@@ -477,6 +507,10 @@ static int cmos_set_alarm(struct device *dev, struct rtc_wkalrm *t)
}
spin_lock_irq(&rtc_lock);
+ if (cmos_wait_for_uip_clear(dev)) {
+ spin_unlock_irq(&rtc_lock);
+ return -EIO;
+ }
/* next rtc irq must not be from previous alarm setting */
cmos_irq_disable(cmos, RTC_AIE);
--
2.25.1
Reading from the CMOS involves writing to the index register and then
reading from the data register. Therefore access to the CMOS has to be
serialized with rtc_lock. This invocation of CMOS_READ was not
serialized, which could cause trouble when other code is accessing CMOS
at the same time.
Use spin_lock_irq() like the rest of the function.
Nothing in kernel modifies the RTC_DM_BINARY bit, so there could be a
separate pair of spin_lock_irq() / spin_unlock_irq() before doing the
math.
Signed-off-by: Mateusz Jończyk <mat.jonczyk(a)o2.pl>
Reviewed-by: Nobuhiro Iwamatsu <iwamatsu(a)nigauri.org>
Cc: Alessandro Zummo <a.zummo(a)towertech.it>
Cc: Alexandre Belloni <alexandre.belloni(a)bootlin.com>
Cc: stable(a)vger.kernel.org
---
drivers/rtc/rtc-cmos.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 670fd8a2970e..2cd0fe728ab2 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -463,7 +463,10 @@ static int cmos_set_alarm(struct device *dev, struct rtc_wkalrm *t)
min = t->time.tm_min;
sec = t->time.tm_sec;
+ spin_lock_irq(&rtc_lock);
rtc_control = CMOS_READ(RTC_CONTROL);
+ spin_unlock_irq(&rtc_lock);
+
if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
/* Writing 0xff means "don't care" or "match all". */
mon = (mon <= 12) ? bin2bcd(mon) : 0xff;
--
2.25.1
This is the start of the stable review cycle for the 5.10.64 release.
There are 26 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 12 Sep 2021 12:29:07 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.64-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.64-rc1
Marek Behún <kabel(a)kernel.org>
PCI: Call Max Payload Size-related fixup quirks early
Paul Gortmaker <paul.gortmaker(a)windriver.com>
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix unsafe memory usage in xhci tracing
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix even more unsafe memory usage in xhci tracing
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: fix the wrong HS mult value
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: use @mult for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: restore HS function when set SS/SSP
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
Yoshihiro Shimoda <yoshihiro.shimoda.uh(a)renesas.com>
usb: host: xhci-rcar: Don't reload firmware after the completion
Alexander Tsoy <alexander(a)tsoy.me>
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Ming Lei <ming.lei(a)redhat.com>
blk-mq: clearing flush request reference in tags->rqs[]
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nftables: clone set element expression template
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: initialize set before expression setup
Eric Dumazet <edumazet(a)google.com>
netfilter: nftables: avoid potential overflows on 32bit arches
Ming Lei <ming.lei(a)redhat.com>
blk-mq: fix is_flush_rq
Ming Lei <ming.lei(a)redhat.com>
blk-mq: fix kernel panic during iterating over flush request
Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
x86/events/amd/iommu: Fix invalid Perf result due to IOMMU PMC power-gating
Hayes Wang <hayeswang(a)realtek.com>
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Jiri Slaby <jirislaby(a)kernel.org>
tty: drop termiox user definitions
Randy Dunlap <rdunlap(a)infradead.org>
net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling
Vignesh Raghavendra <vigneshr(a)ti.com>
serial: 8250: 8250_omap: Fix unused variable warning
Randy Dunlap <rdunlap(a)infradead.org>
net: kcov: don't select SKB_EXTENSIONS when there is no NET
Muchun Song <songmuchun(a)bytedance.com>
mm/page_alloc: speed up the iteration of max_order
Esben Haabendal <esben(a)geanix.com>
net: ll_temac: Remove left-over debug message
Tom Rix <trix(a)redhat.com>
USB: serial: mos7720: improve OOM-handling in read_mos_reg()
Liu Jian <liujian56(a)huawei.com>
igmp: Add ip_mc_list lock in ip_check_mc_rcu
-------------
Diffstat:
Makefile | 4 +-
arch/x86/events/amd/iommu.c | 47 ++++++++-------
arch/x86/kernel/reboot.c | 3 +-
block/blk-core.c | 1 -
block/blk-flush.c | 13 +++++
block/blk-mq.c | 37 +++++++++++-
block/blk.h | 6 +-
drivers/net/ethernet/realtek/r8169_main.c | 1 +
drivers/net/ethernet/xilinx/ll_temac_main.c | 4 +-
drivers/pci/quirks.c | 12 ++--
drivers/tty/serial/8250/8250_omap.c | 26 ++++-----
drivers/usb/gadget/udc/tegra-xudc.c | 4 +-
drivers/usb/host/xhci-debugfs.c | 14 +++--
drivers/usb/host/xhci-rcar.c | 7 +++
drivers/usb/host/xhci-ring.c | 3 +-
drivers/usb/host/xhci-trace.h | 26 +++++----
drivers/usb/host/xhci.h | 73 ++++++++++++-----------
drivers/usb/mtu3/mtu3_core.c | 4 +-
drivers/usb/mtu3/mtu3_gadget.c | 6 +-
drivers/usb/serial/mos7720.c | 4 +-
include/linux/skbuff.h | 4 +-
include/uapi/linux/termios.h | 15 -----
lib/Kconfig.debug | 2 +-
mm/page_alloc.c | 8 +--
net/ipv4/igmp.c | 2 +
net/netfilter/nf_tables_api.c | 89 ++++++++++++++++++-----------
net/netfilter/nft_set_hash.c | 10 ++--
sound/usb/quirks.c | 1 +
28 files changed, 252 insertions(+), 174 deletions(-)
This is the start of the stable review cycle for the 5.14.3 release.
There are 23 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 12 Sep 2021 12:29:07 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.14.3-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.14.3-rc1
Alison Schofield <alison.schofield(a)intel.com>
cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports
Dan Williams <dan.j.williams(a)intel.com>
cxl/pci: Fix lockdown level
Li Qiang (Johnny Li) <johnny.li(a)montage-tech.com>
cxl/pci: Fix debug message in cxl_probe_regs()
Marek Behún <kabel(a)kernel.org>
PCI: Call Max Payload Size-related fixup quirks early
Paul Gortmaker <paul.gortmaker(a)windriver.com>
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Sergio Paracuellos <sergio.paracuellos(a)gmail.com>
staging: mt7621-pci: fix hang when nothing is connected to pcie ports
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Fix failure to give back some cached cancelled URBs.
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix unsafe memory usage in xhci tracing
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix even more unsafe memory usage in xhci tracing
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: fix the wrong HS mult value
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: use @mult for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: restore HS function when set SS/SSP
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: cdnsp: fix the wrong mult value for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: xhci-mtk: fix issue of out-of-bounds array access
Yoshihiro Shimoda <yoshihiro.shimoda.uh(a)renesas.com>
usb: host: xhci-rcar: Don't reload firmware after the completion
Ismael Ferreras Morezuelas <swyterzone(a)gmail.com>
Bluetooth: btusb: Make the CSR clone chip force-suspend workaround more generic
Larry Finger <Larry.Finger(a)lwfinger.net>
Bluetooth: Add additional Bluetooth part for Realtek 8852AE
Alexander Tsoy <alexander(a)tsoy.me>
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Hayes Wang <hayeswang(a)realtek.com>
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Liu Jian <liujian56(a)huawei.com>
igmp: Add ip_mc_list lock in ip_check_mc_rcu
Tong Zhang <ztong0001(a)gmail.com>
can: c_can: fix null-ptr-deref on ioctl()
Hans de Goede <hdegoede(a)redhat.com>
firmware: dmi: Move product_sku info to the end of the modalias
-------------
Diffstat:
Makefile | 4 +-
arch/x86/kernel/reboot.c | 3 +-
drivers/bluetooth/btusb.c | 59 ++++++++++++++-----------
drivers/cxl/acpi.c | 12 +++--
drivers/cxl/pci.c | 6 +--
drivers/firmware/dmi-id.c | 6 ++-
drivers/net/can/c_can/c_can_ethtool.c | 4 +-
drivers/net/ethernet/realtek/r8169_main.c | 1 +
drivers/pci/quirks.c | 12 ++---
drivers/staging/mt7621-pci/pci-mt7621.c | 13 +++++-
drivers/usb/cdns3/cdnsp-mem.c | 2 +-
drivers/usb/gadget/udc/tegra-xudc.c | 4 +-
drivers/usb/host/xhci-debugfs.c | 14 ++++--
drivers/usb/host/xhci-mtk-sch.c | 10 +++--
drivers/usb/host/xhci-rcar.c | 7 +++
drivers/usb/host/xhci-ring.c | 43 +++++++++++-------
drivers/usb/host/xhci-trace.h | 26 ++++++-----
drivers/usb/host/xhci.h | 73 +++++++++++++++----------------
drivers/usb/mtu3/mtu3_core.c | 4 +-
drivers/usb/mtu3/mtu3_gadget.c | 6 +--
net/ipv4/igmp.c | 2 +
sound/usb/quirks.c | 1 +
22 files changed, 185 insertions(+), 127 deletions(-)
This is the start of the stable review cycle for the 5.13.16 release.
There are 22 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 12 Sep 2021 12:29:07 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.13.16-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.13.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.13.16-rc1
Marek Behún <kabel(a)kernel.org>
PCI: Call Max Payload Size-related fixup quirks early
Paul Gortmaker <paul.gortmaker(a)windriver.com>
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Fix failure to give back some cached cancelled URBs.
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix unsafe memory usage in xhci tracing
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: fix even more unsafe memory usage in xhci tracing
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: fix the wrong HS mult value
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: use @mult for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: mtu3: restore HS function when set SS/SSP
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: gadget: tegra-xudc: fix the wrong mult value for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: cdnsp: fix the wrong mult value for HS isoc or intr
Chunfeng Yun <chunfeng.yun(a)mediatek.com>
usb: xhci-mtk: fix issue of out-of-bounds array access
Yoshihiro Shimoda <yoshihiro.shimoda.uh(a)renesas.com>
usb: host: xhci-rcar: Don't reload firmware after the completion
Ismael Ferreras Morezuelas <swyterzone(a)gmail.com>
Bluetooth: btusb: Make the CSR clone chip force-suspend workaround more generic
Larry Finger <Larry.Finger(a)lwfinger.net>
Bluetooth: Add additional Bluetooth part for Realtek 8852AE
Alexander Tsoy <alexander(a)tsoy.me>
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
Ming Lei <ming.lei(a)redhat.com>
blk-mq: clearing flush request reference in tags->rqs[]
Ming Lei <ming.lei(a)redhat.com>
blk-mq: fix is_flush_rq
Ming Lei <ming.lei(a)redhat.com>
blk-mq: fix kernel panic during iterating over flush request
Hayes Wang <hayeswang(a)realtek.com>
Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM"
Esben Haabendal <esben(a)geanix.com>
net: ll_temac: Remove left-over debug message
Liu Jian <liujian56(a)huawei.com>
igmp: Add ip_mc_list lock in ip_check_mc_rcu
Hans de Goede <hdegoede(a)redhat.com>
firmware: dmi: Move product_sku info to the end of the modalias
-------------
Diffstat:
Makefile | 4 +-
arch/x86/kernel/reboot.c | 3 +-
block/blk-core.c | 1 -
block/blk-flush.c | 13 +++++
block/blk-mq.c | 37 ++++++++++++++-
block/blk.h | 6 +--
drivers/bluetooth/btusb.c | 59 +++++++++++++----------
drivers/firmware/dmi-id.c | 6 ++-
drivers/net/ethernet/realtek/r8169_main.c | 1 +
drivers/net/ethernet/xilinx/ll_temac_main.c | 4 +-
drivers/pci/quirks.c | 12 ++---
drivers/usb/cdns3/cdnsp-mem.c | 2 +-
drivers/usb/gadget/udc/tegra-xudc.c | 4 +-
drivers/usb/host/xhci-debugfs.c | 14 ++++--
drivers/usb/host/xhci-mtk-sch.c | 10 ++--
drivers/usb/host/xhci-rcar.c | 7 +++
drivers/usb/host/xhci-ring.c | 43 ++++++++++-------
drivers/usb/host/xhci-trace.h | 26 +++++-----
drivers/usb/host/xhci.h | 73 ++++++++++++++---------------
drivers/usb/mtu3/mtu3_core.c | 4 +-
drivers/usb/mtu3/mtu3_gadget.c | 6 +--
net/ipv4/igmp.c | 2 +
sound/usb/quirks.c | 1 +
23 files changed, 212 insertions(+), 126 deletions(-)
On Sun, Sep 05, 2021 at 01:52:31AM +0200, wim wrote:
>
> Hello Greg,
>
> from kernel-4.9.270 up until now (4.9.282) I experience kernel crashes upon
> loading a GPU module.
> It happens on two out of at least six different machines.
> I can't believe that I'm the only one where that happens, but since the bug
> is still there twelve versions later, I need to report this.
>
> I run Gentoo with vanilla kernels.
> Upon loading i915.ko (automatically or manually) my laptop freezes until
> power-down. (Note that other machines using i915.ko have no problems here.)
> It's an Asus laptop with Intel chipset with a peculiarity:
>
> 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02)
> 01:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
>
> (It uses Intel natively and nobody knows how to make use of that Nvidia chip)
>
>
> On an AMD desktop I get the same crash upon loading of nouveau.ko .
>
> Something ugly must have been introduced in kernel-4.9.270 .
> Strace modprobe .. only prints two lines on the screen.
> Strace modprobe .. 2>&1 > file produces only an empty file.
>
> Any ideas?
Regards,
Wim Osterholt.
From: Zack Rusin <zackr(a)vmware.com>
[ Upstream commit a12be0277316ed923411c9c80b2899ee74d2b033 ]
The has_dx variable was only set during the initialization which
meant that UPDATE_SUBRESOURCE was never used. We were emulating it
with UPDATE_GB_IMAGE but that's always been a stop-gap. Instead
of has_dx which has been deprecated a long time ago we need to check
for whether shader model 4.0 or newer is available to the device.
Signed-off-by: Zack Rusin <zackr(a)vmware.com>
Reviewed-by: Roland Scheidegger <sroland(a)vmware.com>
Reviewed-by: Martin Krastev <krastevm(a)vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210609172307.131929-4-zackr…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
index beab3e19d8e2..5ff88f8c2382 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
@@ -1883,7 +1883,6 @@ static void vmw_surface_dirty_range_add(struct vmw_resource *res, size_t start,
static int vmw_surface_dirty_sync(struct vmw_resource *res)
{
struct vmw_private *dev_priv = res->dev_priv;
- bool has_dx = 0;
u32 i, num_dirty;
struct vmw_surface_dirty *dirty =
(struct vmw_surface_dirty *) res->dirty;
@@ -1910,7 +1909,7 @@ static int vmw_surface_dirty_sync(struct vmw_resource *res)
if (!num_dirty)
goto out;
- alloc_size = num_dirty * ((has_dx) ? sizeof(*cmd1) : sizeof(*cmd2));
+ alloc_size = num_dirty * ((has_sm4_context(dev_priv)) ? sizeof(*cmd1) : sizeof(*cmd2));
cmd = VMW_CMD_RESERVE(dev_priv, alloc_size);
if (!cmd)
return -ENOMEM;
@@ -1928,7 +1927,7 @@ static int vmw_surface_dirty_sync(struct vmw_resource *res)
* DX_UPDATE_SUBRESOURCE is aware of array surfaces.
* UPDATE_GB_IMAGE is not.
*/
- if (has_dx) {
+ if (has_sm4_context(dev_priv)) {
cmd1->header.id = SVGA_3D_CMD_DX_UPDATE_SUBRESOURCE;
cmd1->header.size = sizeof(cmd1->body);
cmd1->body.sid = res->id;
--
2.30.2
From: Guillaume Nault <gnault(a)redhat.com>
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.
rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.
Signed-off-by: Guillaume Nault <gnault(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
A nice side-effect of replacing spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() in peernet2id() is that it avoids the
situation where SoftIRQs get enabled whilst IRQs are turned off.
>From bugzilla.redhat.com/show_bug.cgi?id=1384179 (an ancient
4.9.0-0.rc0 kernel):
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
__local_bh_enable_ip+0x9d/0xc0
_raw_spin_unlock_bh+0x35/0x40
peernet2id+0x54/0x80
netlink_broadcast_filtered+0x220/0x3c0
netlink_broadcast+0x1d/0x20
audit_log+0x6a/0x90
security_set_bools+0xee/0x200
[]
Note, security_set_bools() calls write_lock_irq(). peernet2id() calls
spin_unlock_bh().
>From an internal (UEK) stack trace based on the v4.14.35 kernel (LTS
4.14.231):
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
path_rec_start+0xa3/0x140 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
arp_rcv+0x140/0x1c9
__netif_receive_skb_core+0x401/0xb39
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
rdma_resolve_route+0x287/0x810 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
ret_from_fork+0x24/0x49
Here, pay attention to ib_nl_make_request() which calls
spin_lock_irqsave() on a global lock just before calling
rdma_nl_multicast(). Thereafter, peernet2id() enables SoftIRQs, and
ipoib starts and calls the same path and ends up trying to acquire the
same global lock again.
(cherry picked from commit 2dce224f469f060b9998a5a869151ef83c08ce77)
Fixes: fba143c66abb ("netns: avoid disabling irq for netns id")
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Conflicts:
net/core/net_namespace.c
* rtnl_valid_dump_net_req() has a very minimal
implementation in 4.14, hence only a simple
substituting of spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() was required
---
net/core/net_namespace.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1af25d5..37f4313 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -181,9 +181,9 @@ static int net_eq_idr(int id, void *net, void *peer)
return 0;
}
-/* Should be called with nsid_lock held. If a new id is assigned, the bool alloc
- * is set to true, thus the caller knows that the new id must be notified via
- * rtnl.
+/* Must be called from RCU-critical section or with nsid_lock held. If
+ * a new id is assigned, the bool alloc is set to true, thus the
+ * caller knows that the new id must be notified via rtnl.
*/
static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
{
@@ -207,7 +207,7 @@ static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
return NETNSA_NSID_NOT_ASSIGNED;
}
-/* should be called with nsid_lock held */
+/* Must be called from RCU-critical section or with nsid_lock held */
static int __peernet2id(struct net *net, struct net *peer)
{
bool no = false;
@@ -240,9 +240,10 @@ int peernet2id(struct net *net, struct net *peer)
{
int id;
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
id = __peernet2id(net, peer);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
+
return id;
}
EXPORT_SYMBOL(peernet2id);
@@ -761,6 +762,7 @@ struct rtnl_net_dump_cb {
int s_idx;
};
+/* Runs in RCU-critical section. */
static int rtnl_net_dumpid_one(int id, void *peer, void *data)
{
struct rtnl_net_dump_cb *net_cb = (struct rtnl_net_dump_cb *)data;
@@ -791,9 +793,9 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
.s_idx = cb->args[0],
};
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
idr_for_each(&net->netns_ids, rtnl_net_dumpid_one, &net_cb);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
cb->args[0] = net_cb.idx;
return skb->len;
--
1.8.3.1
From: Guillaume Nault <gnault(a)redhat.com>
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.
rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.
Signed-off-by: Guillaume Nault <gnault(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
A nice side-effect of replacing spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() in peernet2id() is that it avoids the
situation where SoftIRQs get enabled whilst IRQs are turned off.
>From bugzilla.redhat.com/show_bug.cgi?id=1384179 (an ancient
4.9.0-0.rc0 kernel):
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
__local_bh_enable_ip+0x9d/0xc0
_raw_spin_unlock_bh+0x35/0x40
peernet2id+0x54/0x80
netlink_broadcast_filtered+0x220/0x3c0
netlink_broadcast+0x1d/0x20
audit_log+0x6a/0x90
security_set_bools+0xee/0x200
[]
Note, security_set_bools() calls write_lock_irq(). peernet2id() calls
spin_unlock_bh().
>From an internal (UEK) stack trace based on the v4.14.35 kernel (LTS
4.14.231):
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
path_rec_start+0xa3/0x140 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
arp_rcv+0x140/0x1c9
__netif_receive_skb_core+0x401/0xb39
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
rdma_resolve_route+0x287/0x810 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
ret_from_fork+0x24/0x49
Here, pay attention to ib_nl_make_request() which calls
spin_lock_irqsave() on a global lock just before calling
rdma_nl_multicast(). Thereafter, peernet2id() enables SoftIRQs, and
ipoib starts and calls the same path and ends up trying to acquire the
same global lock again.
(cherry picked from commit 2dce224f469f060b9998a5a869151ef83c08ce77)
Fixes: fba143c66abb ("netns: avoid disabling irq for netns id")
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Conflicts:
net/core/net_namespace.c
* rtnl_valid_dump_net_req() has a very minimal
implementation in 4.19, hence only a simple
substituting of spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() was required
---
net/core/net_namespace.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 939d8a3..26d70c0 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -192,9 +192,9 @@ static int net_eq_idr(int id, void *net, void *peer)
return 0;
}
-/* Should be called with nsid_lock held. If a new id is assigned, the bool alloc
- * is set to true, thus the caller knows that the new id must be notified via
- * rtnl.
+/* Must be called from RCU-critical section or with nsid_lock held. If
+ * a new id is assigned, the bool alloc is set to true, thus the
+ * caller knows that the new id must be notified via rtnl.
*/
static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
{
@@ -218,7 +218,7 @@ static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
return NETNSA_NSID_NOT_ASSIGNED;
}
-/* should be called with nsid_lock held */
+/* Must be called from RCU-critical section or with nsid_lock held */
static int __peernet2id(struct net *net, struct net *peer)
{
bool no = false;
@@ -261,9 +261,10 @@ int peernet2id(struct net *net, struct net *peer)
{
int id;
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
id = __peernet2id(net, peer);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
+
return id;
}
EXPORT_SYMBOL(peernet2id);
@@ -837,6 +838,7 @@ struct rtnl_net_dump_cb {
int s_idx;
};
+/* Runs in RCU-critical section. */
static int rtnl_net_dumpid_one(int id, void *peer, void *data)
{
struct rtnl_net_dump_cb *net_cb = (struct rtnl_net_dump_cb *)data;
@@ -867,9 +869,9 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
.s_idx = cb->args[0],
};
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
idr_for_each(&net->netns_ids, rtnl_net_dumpid_one, &net_cb);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
cb->args[0] = net_cb.idx;
return skb->len;
--
1.8.3.1
From: Guillaume Nault <gnault(a)redhat.com>
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.
rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.
Signed-off-by: Guillaume Nault <gnault(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
A nice side-effect of replacing spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() in peernet2id() is that it avoids the
situation where SoftIRQs get enabled whilst IRQs are turned off.
>From bugzilla.redhat.com/show_bug.cgi?id=1384179 (an ancient
4.9.0-0.rc0 kernel):
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
__local_bh_enable_ip+0x9d/0xc0
_raw_spin_unlock_bh+0x35/0x40
peernet2id+0x54/0x80
netlink_broadcast_filtered+0x220/0x3c0
netlink_broadcast+0x1d/0x20
audit_log+0x6a/0x90
security_set_bools+0xee/0x200
[]
Note, security_set_bools() calls write_lock_irq(). peernet2id() calls
spin_unlock_bh().
>From an internal (UEK) stack trace based on the v4.14.35 kernel (LTS
4.14.231):
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
path_rec_start+0xa3/0x140 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
arp_rcv+0x140/0x1c9
__netif_receive_skb_core+0x401/0xb39
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
rdma_resolve_route+0x287/0x810 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
ret_from_fork+0x24/0x49
Here, pay attention to ib_nl_make_request() which calls
spin_lock_irqsave() on a global lock just before calling
rdma_nl_multicast(). Thereafter, peernet2id() enables SoftIRQs, and
ipoib starts and calls the same path and ends up trying to acquire the
same global lock again.
(cherry picked from commit 2dce224f469f060b9998a5a869151ef83c08ce77)
Fixes: fba143c66abb ("netns: avoid disabling irq for netns id")
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Conflicts:
net/core/net_namespace.c
* rtnl_valid_dump_net_req() has a very minimal
implementation in 4.14, hence only a simple
substituting of spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() was required
---
net/core/net_namespace.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1af25d5..37f4313 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -181,9 +181,9 @@ static int net_eq_idr(int id, void *net, void *peer)
return 0;
}
-/* Should be called with nsid_lock held. If a new id is assigned, the bool alloc
- * is set to true, thus the caller knows that the new id must be notified via
- * rtnl.
+/* Must be called from RCU-critical section or with nsid_lock held. If
+ * a new id is assigned, the bool alloc is set to true, thus the
+ * caller knows that the new id must be notified via rtnl.
*/
static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
{
@@ -207,7 +207,7 @@ static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
return NETNSA_NSID_NOT_ASSIGNED;
}
-/* should be called with nsid_lock held */
+/* Must be called from RCU-critical section or with nsid_lock held */
static int __peernet2id(struct net *net, struct net *peer)
{
bool no = false;
@@ -240,9 +240,10 @@ int peernet2id(struct net *net, struct net *peer)
{
int id;
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
id = __peernet2id(net, peer);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
+
return id;
}
EXPORT_SYMBOL(peernet2id);
@@ -761,6 +762,7 @@ struct rtnl_net_dump_cb {
int s_idx;
};
+/* Runs in RCU-critical section. */
static int rtnl_net_dumpid_one(int id, void *peer, void *data)
{
struct rtnl_net_dump_cb *net_cb = (struct rtnl_net_dump_cb *)data;
@@ -791,9 +793,9 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
.s_idx = cb->args[0],
};
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
idr_for_each(&net->netns_ids, rtnl_net_dumpid_one, &net_cb);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
cb->args[0] = net_cb.idx;
return skb->len;
--
1.8.3.1
From: Guillaume Nault <gnault(a)redhat.com>
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.
rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.
Signed-off-by: Guillaume Nault <gnault(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
A nice side-effect of replacing spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() in peernet2id() is that it avoids the
situation where SoftIRQs get enabled whilst IRQs are turned off.
>From bugzilla.redhat.com/show_bug.cgi?id=1384179 (an ancient
4.9.0-0.rc0 kernel):
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
__local_bh_enable_ip+0x9d/0xc0
_raw_spin_unlock_bh+0x35/0x40
peernet2id+0x54/0x80
netlink_broadcast_filtered+0x220/0x3c0
netlink_broadcast+0x1d/0x20
audit_log+0x6a/0x90
security_set_bools+0xee/0x200
[]
Note, security_set_bools() calls write_lock_irq(). peernet2id() calls
spin_unlock_bh().
>From an internal (UEK) stack trace based on the v4.14.35 kernel (LTS
4.14.231):
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
path_rec_start+0xa3/0x140 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
arp_rcv+0x140/0x1c9
__netif_receive_skb_core+0x401/0xb39
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
rdma_resolve_route+0x287/0x810 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
ret_from_fork+0x24/0x49
Here, pay attention to ib_nl_make_request() which calls
spin_lock_irqsave() on a global lock just before calling
rdma_nl_multicast(). Thereafter, peernet2id() enables SoftIRQs, and
ipoib starts and calls the same path and ends up trying to acquire the
same global lock again.
(cherry picked from commit 2dce224f469f060b9998a5a869151ef83c08ce77)
Fixes: fba143c66abb ("netns: avoid disabling irq for netns id")
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Conflicts:
net/core/net_namespace.c
* rtnl_valid_dump_net_req() has a very minimal
implementation in 4.19, hence only a simple
substituting of spin_{lock,unlock}_bh() with
rcu_spin_{lock,unlock}() was required
---
net/core/net_namespace.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 939d8a3..26d70c0 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -192,9 +192,9 @@ static int net_eq_idr(int id, void *net, void *peer)
return 0;
}
-/* Should be called with nsid_lock held. If a new id is assigned, the bool alloc
- * is set to true, thus the caller knows that the new id must be notified via
- * rtnl.
+/* Must be called from RCU-critical section or with nsid_lock held. If
+ * a new id is assigned, the bool alloc is set to true, thus the
+ * caller knows that the new id must be notified via rtnl.
*/
static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
{
@@ -218,7 +218,7 @@ static int __peernet2id_alloc(struct net *net, struct net *peer, bool *alloc)
return NETNSA_NSID_NOT_ASSIGNED;
}
-/* should be called with nsid_lock held */
+/* Must be called from RCU-critical section or with nsid_lock held */
static int __peernet2id(struct net *net, struct net *peer)
{
bool no = false;
@@ -261,9 +261,10 @@ int peernet2id(struct net *net, struct net *peer)
{
int id;
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
id = __peernet2id(net, peer);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
+
return id;
}
EXPORT_SYMBOL(peernet2id);
@@ -837,6 +838,7 @@ struct rtnl_net_dump_cb {
int s_idx;
};
+/* Runs in RCU-critical section. */
static int rtnl_net_dumpid_one(int id, void *peer, void *data)
{
struct rtnl_net_dump_cb *net_cb = (struct rtnl_net_dump_cb *)data;
@@ -867,9 +869,9 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
.s_idx = cb->args[0],
};
- spin_lock_bh(&net->nsid_lock);
+ rcu_read_lock();
idr_for_each(&net->netns_ids, rtnl_net_dumpid_one, &net_cb);
- spin_unlock_bh(&net->nsid_lock);
+ rcu_read_unlock();
cb->args[0] = net_cb.idx;
return skb->len;
--
1.8.3.1
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index fb733c4e1c11..3a58fdf0c566 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -112,7 +112,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index 174f68c26c1b..34981b98d807 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -123,7 +123,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index dd559045593d..d2d1bae63a36 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -123,7 +123,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index c059ab74ed88..a4a75fa79524 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -122,7 +122,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index 44919d48d241..c459f169a904 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -122,7 +122,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index 44919d48d241..c459f169a904 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -122,7 +122,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
[ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
Fix the equation by substituting s/>=/<=/.
Suggested-by: Ryan J. Barnett <ryan.barnett(a)collins.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Acked-by: Rob Herring <robh(a)kernel.org>
Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bo…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
index 44919d48d241..c459f169a904 100644
--- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt
@@ -122,7 +122,7 @@ on various other factors also like;
so the device should have enough free bytes available its OOB/Spare
area to accommodate ECC for entire page. In general following expression
helps in determining if given device can accommodate ECC syndrome:
- "2 + (PAGESIZE / 512) * ECC_BYTES" >= OOBSIZE"
+ "2 + (PAGESIZE / 512) * ECC_BYTES" <= OOBSIZE"
where
OOBSIZE number of bytes in OOB/spare area
PAGESIZE number of bytes in main-area of device page
--
2.30.2
Hello,
please consider applying these nf_tables fixes to the 5.10.y tree.
These patches had to mangled to make them apply to 5.10.y.
I've done the follwoing tests in a kasan/kmemleak enabled vm:
1. run upstream nft python/shell tests.
Without patch 2 and 3 doing so results in kernel crash.
Some tests fail but afaics those are expected to
fail on 5.10 due to lack of feature being tested.
2. Tested the 'conncount' feature (its affected by last patch).
Worked as designed.
3. ran nftables related kernel self tests.
No kmemleak or kasan splats were seen.
Eric Dumazet (1):
netfilter: nftables: avoid potential overflows on 32bit arches
Pablo Neira Ayuso (2):
netfilter: nf_tables: initialize set before expression setup
netfilter: nftables: clone set element expression template
net/netfilter/nf_tables_api.c | 89 ++++++++++++++++++++++-------------
net/netfilter/nft_set_hash.c | 10 ++--
2 files changed, 62 insertions(+), 37 deletions(-)
--
2.32.0
Hello! I would like to request for backporting following ath9k commits
which are fixing CVE-2020-3702 issue.
56c5485c9e44 ("ath: Use safer key clearing with key cache entries")
73488cb2fa3b ("ath9k: Clear key cache explicitly on disabling hardware")
d2d3e36498dd ("ath: Export ath_hw_keysetmac()")
144cd24dbc36 ("ath: Modify ath_key_delete() to not need full key entry")
ca2848022c12 ("ath9k: Postpone key cache entry deletion for TXQ frames reference it")
See also:
https://lore.kernel.org/linux-wireless/87o8hvlx5g.fsf@codeaurora.org/
This CVE-2020-3702 issue affects ath9k driver in stable kernel versions.
And due to this issue Qualcomm suggests to not use open source ath9k
driver and instead to use their proprietary driver which do not have
this issue.
Details about CVE-2020-3702 are described on the ESET blog post:
https://www.welivesecurity.com/2020/08/06/beyond-kr00k-even-more-wifi-chips…
Two months ago ESET tested above mentioned commits applied on top of
4.14 stable tree and confirmed that issue cannot be reproduced anymore
with those patches. Commits were applied cleanly on top of 4.14 stable
tree without need to do any modification.
A prior patch inadvertently caused lpfc_sli_sum_iocb to exclude
counting of outstanding aborted I/Os and ABORT iocbs. Thus,
lpfc_reset_flush_io_context called from any TMF routine does not
properly wait to flush all outstanding FCP iocbs leading to a
block layer crash on an invalid scsi_cmnd->request pointer.
kernel BUG at ../block/blk-core.c:1489!
RIP: 0010:blk_requeue_request+0xaf/0xc0
...
Call Trace:
<IRQ>
__scsi_queue_insert+0x90/0xe0 [scsi_mod]
blk_done_softirq+0x7e/0x90
__do_softirq+0xd2/0x280
irq_exit+0xd5/0xe0
do_IRQ+0x4c/0xd0
common_interrupt+0x87/0x87
</IRQ>
Fix by separating out the LPFC_IO_FCP, LPFC_IO_ON_TXCMPLQ,
LPFC_DRIVER_ABORTED, and CMD_ABORT_XRI_CN || CMD_CLOSE_XRI_CN checks
into a new lpfc_sli_validate_fcp_iocb_for_abort routine when determining
to build an ABORT iocb.
Restore lpfc_reset_flush_io_context functionality by including
counting of outstanding aborted iocbs and ABORT iocbs in lpfc_sli_sum_iocb.
Fixes: e1364711359f ("scsi: lpfc: Fix illegal memory access on Abort IOCBs")
Cc: <stable(a)vger.kernel.org> # v5.12+
Co-developed-by: Justin Tee <justin.tee(a)broadcom.com>
Signed-off-by: Justin Tee <justin.tee(a)broadcom.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
---
drivers/scsi/lpfc/lpfc_sli.c | 101 +++++++++++++++++++++++++++--------
1 file changed, 78 insertions(+), 23 deletions(-)
diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index 546c851938bc..e8f6ad484768 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -12485,15 +12485,54 @@ lpfc_sli_hba_iocb_abort(struct lpfc_hba *phba)
}
/**
- * lpfc_sli_validate_fcp_iocb - find commands associated with a vport or LUN
+ * lpfc_sli_validate_fcp_iocb_for_abort - filter iocbs appropriate for FCP aborts
+ * @iocbq: Pointer to iocb object.
+ * @vport: Pointer to driver virtual port object.
+ *
+ * This function acts as an iocb filter for functions which abort FCP iocbs.
+ *
+ * Return values
+ * -ENODEV, if a null iocb or vport ptr is encountered
+ * -EINVAL, if the iocb is not an FCP I/O, not on the TX cmpl queue, premarked as
+ * driver already started the abort process, or is an abort iocb itself
+ * 0, passes criteria for aborting the FCP I/O iocb
+ **/
+static int
+lpfc_sli_validate_fcp_iocb_for_abort(struct lpfc_iocbq *iocbq,
+ struct lpfc_vport *vport)
+{
+ IOCB_t *icmd = NULL;
+
+ /* No null ptr vports */
+ if (!iocbq || iocbq->vport != vport)
+ return -ENODEV;
+
+ /* iocb must be for FCP IO, already exists on the TX cmpl queue,
+ * can't be premarked as driver aborted, nor be an ABORT iocb itself
+ */
+ icmd = &iocbq->iocb;
+ if (!(iocbq->iocb_flag & LPFC_IO_FCP) ||
+ !(iocbq->iocb_flag & LPFC_IO_ON_TXCMPLQ) ||
+ (iocbq->iocb_flag & LPFC_DRIVER_ABORTED) ||
+ (icmd->ulpCommand == CMD_ABORT_XRI_CN ||
+ icmd->ulpCommand == CMD_CLOSE_XRI_CN))
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * lpfc_sli_validate_fcp_iocb - validate commands associated with a SCSI target
* @iocbq: Pointer to driver iocb object.
* @vport: Pointer to driver virtual port object.
* @tgt_id: SCSI ID of the target.
* @lun_id: LUN ID of the scsi device.
* @ctx_cmd: LPFC_CTX_LUN/LPFC_CTX_TGT/LPFC_CTX_HOST
*
- * This function acts as an iocb filter for functions which abort or count
- * all FCP iocbs pending on a lun/SCSI target/SCSI host. It will return
+ * This function acts as an iocb filter for validating a lun/SCSI target/SCSI
+ * host.
+ *
+ * It will return
* 0 if the filtering criteria is met for the given iocb and will return
* 1 if the filtering criteria is not met.
* If ctx_cmd == LPFC_CTX_LUN, the function returns 0 only if the
@@ -12512,22 +12551,8 @@ lpfc_sli_validate_fcp_iocb(struct lpfc_iocbq *iocbq, struct lpfc_vport *vport,
lpfc_ctx_cmd ctx_cmd)
{
struct lpfc_io_buf *lpfc_cmd;
- IOCB_t *icmd = NULL;
int rc = 1;
- if (!iocbq || iocbq->vport != vport)
- return rc;
-
- if (!(iocbq->iocb_flag & LPFC_IO_FCP) ||
- !(iocbq->iocb_flag & LPFC_IO_ON_TXCMPLQ) ||
- iocbq->iocb_flag & LPFC_DRIVER_ABORTED)
- return rc;
-
- icmd = &iocbq->iocb;
- if (icmd->ulpCommand == CMD_ABORT_XRI_CN ||
- icmd->ulpCommand == CMD_CLOSE_XRI_CN)
- return rc;
-
lpfc_cmd = container_of(iocbq, struct lpfc_io_buf, cur_iocbq);
if (lpfc_cmd->pCmd == NULL)
@@ -12582,17 +12607,33 @@ lpfc_sli_sum_iocb(struct lpfc_vport *vport, uint16_t tgt_id, uint64_t lun_id,
{
struct lpfc_hba *phba = vport->phba;
struct lpfc_iocbq *iocbq;
+ IOCB_t *icmd = NULL;
int sum, i;
+ unsigned long iflags;
- spin_lock_irq(&phba->hbalock);
+ spin_lock_irqsave(&phba->hbalock, iflags);
for (i = 1, sum = 0; i <= phba->sli.last_iotag; i++) {
iocbq = phba->sli.iocbq_lookup[i];
- if (lpfc_sli_validate_fcp_iocb (iocbq, vport, tgt_id, lun_id,
- ctx_cmd) == 0)
+ if (!iocbq || iocbq->vport != vport)
+ continue;
+ if (!(iocbq->iocb_flag & LPFC_IO_FCP) ||
+ !(iocbq->iocb_flag & LPFC_IO_ON_TXCMPLQ))
+ continue;
+
+ /* Include counting outstanding aborts */
+ icmd = &iocbq->iocb;
+ if (icmd->ulpCommand == CMD_ABORT_XRI_CN ||
+ icmd->ulpCommand == CMD_CLOSE_XRI_CN) {
+ sum++;
+ continue;
+ }
+
+ if (lpfc_sli_validate_fcp_iocb(iocbq, vport, tgt_id, lun_id,
+ ctx_cmd) == 0)
sum++;
}
- spin_unlock_irq(&phba->hbalock);
+ spin_unlock_irqrestore(&phba->hbalock, iflags);
return sum;
}
@@ -12659,7 +12700,11 @@ lpfc_sli_abort_fcp_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
*
* This function sends an abort command for every SCSI command
* associated with the given virtual port pending on the ring
- * filtered by lpfc_sli_validate_fcp_iocb function.
+ * filtered by lpfc_sli_validate_fcp_iocb_for_abort and then
+ * lpfc_sli_validate_fcp_iocb function. The ordering for validation before
+ * submitting abort iocbs must be lpfc_sli_validate_fcp_iocb_for_abort
+ * followed by lpfc_sli_validate_fcp_iocb.
+ *
* When abort_cmd == LPFC_CTX_LUN, the function sends abort only to the
* FCP iocbs associated with lun specified by tgt_id and lun_id
* parameters
@@ -12691,6 +12736,9 @@ lpfc_sli_abort_iocb(struct lpfc_vport *vport, u16 tgt_id, u64 lun_id,
for (i = 1; i <= phba->sli.last_iotag; i++) {
iocbq = phba->sli.iocbq_lookup[i];
+ if (lpfc_sli_validate_fcp_iocb_for_abort(iocbq, vport))
+ continue;
+
if (lpfc_sli_validate_fcp_iocb(iocbq, vport, tgt_id, lun_id,
abort_cmd) != 0)
continue;
@@ -12723,7 +12771,11 @@ lpfc_sli_abort_iocb(struct lpfc_vport *vport, u16 tgt_id, u64 lun_id,
*
* This function sends an abort command for every SCSI command
* associated with the given virtual port pending on the ring
- * filtered by lpfc_sli_validate_fcp_iocb function.
+ * filtered by lpfc_sli_validate_fcp_iocb_for_abort and then
+ * lpfc_sli_validate_fcp_iocb function. The ordering for validation before
+ * submitting abort iocbs must be lpfc_sli_validate_fcp_iocb_for_abort
+ * followed by lpfc_sli_validate_fcp_iocb.
+ *
* When taskmgmt_cmd == LPFC_CTX_LUN, the function sends abort only to the
* FCP iocbs associated with lun specified by tgt_id and lun_id
* parameters
@@ -12761,6 +12813,9 @@ lpfc_sli_abort_taskmgmt(struct lpfc_vport *vport, struct lpfc_sli_ring *pring,
for (i = 1; i <= phba->sli.last_iotag; i++) {
iocbq = phba->sli.iocbq_lookup[i];
+ if (lpfc_sli_validate_fcp_iocb_for_abort(iocbq, vport))
+ continue;
+
if (lpfc_sli_validate_fcp_iocb(iocbq, vport, tgt_id, lun_id,
cmd) != 0)
continue;
--
2.26.2
In a rarely executed path, FLOGI failure, there is a refcounting error.
If FLOGI completed with an error, typically a timeout, the initial
completion handler would remove the job reference. However, the job
completion isn't the actual end of the job/exchange as the timeout
usually initiates an ABTS, and upon that ABTS completion, a final
completion is sent. The driver removes the reference again in the
final completion. Thus the imbalance.
In the buggy cases, if there was a link bounce while the delayed
response is outstanding, the fport node may be referenced again
but there was no additional reference as it is already present. The
delayed completion then occurs and removes the last reference freeing
the node and causing issues in the link up processed that is using the
node.
Fix this scenario by removing the snippet that removed the reference
in the initial flogi completion. The bad snippet was poorly trying to
identify the flogi as ok to do so by realizing the node was not
registered with either SCSI or NVME transport.
Fixes: 618e2ee146d4 ("scsi: lpfc: Fix FLOGI failure due to accessing a freed node")
Cc: <stable(a)vger.kernel.org> # v5.13+
Co-developed-by: Justin Tee <justin.tee(a)broadcom.com>
Signed-off-by: Justin Tee <justin.tee(a)broadcom.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
---
drivers/scsi/lpfc/lpfc_els.c | 11 +++++------
drivers/scsi/lpfc/lpfc_hbadisc.c | 10 ++++++----
drivers/scsi/lpfc/lpfc_nvme.c | 5 +++--
3 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 1254a575fd47..df5fc223ddb2 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -1059,9 +1059,10 @@ lpfc_cmpl_els_flogi(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
lpfc_printf_vlog(vport, KERN_WARNING, LOG_TRACE_EVENT,
"0150 FLOGI failure Status:x%x/x%x "
- "xri x%x TMO:x%x\n",
+ "xri x%x TMO:x%x refcnt %d\n",
irsp->ulpStatus, irsp->un.ulpWord[4],
- cmdiocb->sli4_xritag, irsp->ulpTimeout);
+ cmdiocb->sli4_xritag, irsp->ulpTimeout,
+ kref_read(&ndlp->kref));
/* If this is not a loop open failure, bail out */
if (!(irsp->ulpStatus == IOSTAT_LOCAL_REJECT &&
@@ -1122,12 +1123,12 @@ lpfc_cmpl_els_flogi(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
/* FLOGI completes successfully */
lpfc_printf_vlog(vport, KERN_INFO, LOG_ELS,
"0101 FLOGI completes successfully, I/O tag:x%x, "
- "xri x%x Data: x%x x%x x%x x%x x%x x%x x%x\n",
+ "xri x%x Data: x%x x%x x%x x%x x%x x%x x%x %d\n",
cmdiocb->iotag, cmdiocb->sli4_xritag,
irsp->un.ulpWord[4], sp->cmn.e_d_tov,
sp->cmn.w2.r_a_tov, sp->cmn.edtovResolution,
vport->port_state, vport->fc_flag,
- sp->cmn.priority_tagging);
+ sp->cmn.priority_tagging, kref_read(&ndlp->kref));
if (sp->cmn.priority_tagging)
vport->vmid_flag |= LPFC_VMID_ISSUE_QFPA;
@@ -1205,8 +1206,6 @@ lpfc_cmpl_els_flogi(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
phba->fcf.fcf_flag &= ~FCF_DISCOVERY;
spin_unlock_irq(&phba->hbalock);
- if (!(ndlp->fc4_xpt_flags & (SCSI_XPT_REGD | NVME_XPT_REGD)))
- lpfc_nlp_put(ndlp);
if (!lpfc_error_lost_link(irsp)) {
/* FLOGI failed, so just use loop map to make discovery list */
lpfc_disc_list_loopmap(vport);
diff --git a/drivers/scsi/lpfc/lpfc_hbadisc.c b/drivers/scsi/lpfc/lpfc_hbadisc.c
index 7195ca0275f9..6f2e07c30f98 100644
--- a/drivers/scsi/lpfc/lpfc_hbadisc.c
+++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
@@ -4449,8 +4449,9 @@ lpfc_register_remote_port(struct lpfc_vport *vport, struct lpfc_nodelist *ndlp)
fc_remote_port_rolechg(rport, rport_ids.roles);
lpfc_printf_vlog(ndlp->vport, KERN_INFO, LOG_NODE,
- "3183 %s rport x%px DID x%x, role x%x\n",
- __func__, rport, rport->port_id, rport->roles);
+ "3183 %s rport x%px DID x%x, role x%x refcnt %d\n",
+ __func__, rport, rport->port_id, rport->roles,
+ kref_read(&ndlp->kref));
if ((rport->scsi_target_id != -1) &&
(rport->scsi_target_id < LPFC_MAX_TARGET)) {
@@ -4475,8 +4476,9 @@ lpfc_unregister_remote_port(struct lpfc_nodelist *ndlp)
lpfc_printf_vlog(vport, KERN_INFO, LOG_NODE,
"3184 rport unregister x%06x, rport x%px "
- "xptflg x%x\n",
- ndlp->nlp_DID, rport, ndlp->fc4_xpt_flags);
+ "xptflg x%x refcnt %d\n",
+ ndlp->nlp_DID, rport, ndlp->fc4_xpt_flags,
+ kref_read(&ndlp->kref));
fc_remote_port_delete(rport);
lpfc_nlp_put(ndlp);
diff --git a/drivers/scsi/lpfc/lpfc_nvme.c b/drivers/scsi/lpfc/lpfc_nvme.c
index 73a3568ff17e..bd88477f9b82 100644
--- a/drivers/scsi/lpfc/lpfc_nvme.c
+++ b/drivers/scsi/lpfc/lpfc_nvme.c
@@ -209,8 +209,9 @@ lpfc_nvme_remoteport_delete(struct nvme_fc_remote_port *remoteport)
* calling state machine to remove the node.
*/
lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME_DISC,
- "6146 remoteport delete of remoteport x%px\n",
- remoteport);
+ "6146 remoteport delete of remoteport x%px, ndlp x%px "
+ "DID x%x xflags x%x\n",
+ remoteport, ndlp, ndlp->nlp_DID, ndlp->fc4_xpt_flags);
spin_lock_irq(&ndlp->lock);
/* The register rebind might have occurred before the delete
--
2.26.2
Given the decision to delay cxl_test and some of the related reworks to
the next merge window, here are the broken out fixes that will be
appended to the base-commit noted below. Changes from previous posting
include:
- "cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports": Add a
comment about when acpi_pci_find_root() is known to not fail
(Jonathan)
- Fix lockdown reason in cxl_mem_raw_command_allowed() (Ondrej)
- Pick up, with small change log tweaks, Ben's defined, but not used patch
- Fix some 'make docs' warnings (Ben)
---
Alison Schofield (1):
cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports
Ben Widawsky (1):
cxl/uapi: Fix defined but not used warnings
Dan Williams (3):
cxl/pci: Fix lockdown level
cxl/pmem: Fix Documentation warning
cxl/registers: Fix Documentation warning
Li Qiang (Johnny Li) (1):
cxl/pci: Fix debug message in cxl_probe_regs()
Documentation/driver-api/cxl/memory-devices.rst | 4 ++-
drivers/cxl/acpi.c | 12 ++++++---
drivers/cxl/core/pmem.c | 30 +++++++++++++++++++++--
drivers/cxl/core/regs.c | 15 +++++++++++-
drivers/cxl/pci.c | 6 ++---
include/uapi/linux/cxl_mem.h | 2 +-
6 files changed, 56 insertions(+), 13 deletions(-)
base-commit: 00ca683e618065e2375b49c91002384735c76d41
Hi Zheyu,
On Mon, Jul 26, 2021 at 12:04 PM Zheyu Ma <zheyuma97(a)gmail.com> wrote:
> The userspace program could pass any values to the driver through
> ioctl() interface. If the driver doesn't check the value of 'pixclock',
> it may cause divide error.
>
> Fix this by checking whether 'pixclock' is zero first.
>
> The following log reveals it:
>
> [ 43.861711] divide error: 0000 [#1] PREEMPT SMP KASAN PTI
> [ 43.861737] CPU: 2 PID: 11764 Comm: i740 Not tainted 5.14.0-rc2-00513-gac532c9bbcfb-dirty #224
> [ 43.861756] RIP: 0010:asiliantfb_check_var+0x4e/0x730
> [ 43.861843] Call Trace:
> [ 43.861848] ? asiliantfb_remove+0x190/0x190
> [ 43.861858] fb_set_var+0x2e4/0xeb0
> [ 43.861866] ? fb_blank+0x1a0/0x1a0
> [ 43.861873] ? lock_acquire+0x1ef/0x530
> [ 43.861884] ? lock_release+0x810/0x810
> [ 43.861892] ? lock_is_held_type+0x100/0x140
> [ 43.861903] ? ___might_sleep+0x1ee/0x2d0
> [ 43.861914] ? __mutex_lock+0x620/0x1190
> [ 43.861921] ? do_fb_ioctl+0x313/0x700
> [ 43.861929] ? mutex_lock_io_nested+0xfa0/0xfa0
> [ 43.861936] ? __this_cpu_preempt_check+0x1d/0x30
> [ 43.861944] ? _raw_spin_unlock_irqrestore+0x46/0x60
> [ 43.861952] ? lockdep_hardirqs_on+0x59/0x100
> [ 43.861959] ? _raw_spin_unlock_irqrestore+0x46/0x60
> [ 43.861967] ? trace_hardirqs_on+0x6a/0x1c0
> [ 43.861978] do_fb_ioctl+0x31e/0x700
>
> Signed-off-by: Zheyu Ma <zheyuma97(a)gmail.com>
Thanks for your patch!
> ---
> Changes in v2:
> - Make commit log more descriptive
> ---
> drivers/video/fbdev/asiliantfb.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/video/fbdev/asiliantfb.c b/drivers/video/fbdev/asiliantfb.c
> index 3e006da47752..84c56f525889 100644
> --- a/drivers/video/fbdev/asiliantfb.c
> +++ b/drivers/video/fbdev/asiliantfb.c
> @@ -227,6 +227,9 @@ static int asiliantfb_check_var(struct fb_var_screeninfo *var,
> {
> unsigned long Ftarget, ratio, remainder;
>
> + if (!var->pixclock)
> + return -EINVAL;
While this fixes the crash, it is not correct: according to the
fbdev API, invalid values must be rounded up to a supported value,
if possible. -EINVAL should only be returned if rounding up values
in fb_var_screeninfo cannot give a valid mode.
The same comment applies to the other patches in this series:
[PATCH v2 2/3] video: fbdev: kyro: Error out if 'pixclock' equals zero
[PATCH v2 3/3] video: fbdev: riva: Error out if 'pixclock' equals zero
> +
> ratio = 1000000 / var->pixclock;
> remainder = 1000000 % var->pixclock;
> Ftarget = 1000000 * ratio + (1000000 * remainder) / var->pixclock;
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds