Existing data command CRC error handling on kernel 4.20 stable branch
is non-standard and does not work with some Intel host controllers.
Specifically, the assumption that the host controller will continue
operating normally after the error interrupt, is not valid. So we
suggest cherry-pick the 3 patches listed below to kernel 4.20 stable
branch, which can change the driver to handle the error in the same
manner as a data CRC error.
4bf7809: mmc: sdhci: Fix data command CRC error handling
869f8a6: mmc: sdhci: Rename SDHCI_ACMD12_ERR and SDHCI_INT_ACMD12ERR
af849c8: mmc: sdhci: Handle auto-command errors
All of the patches above have landed on kernel 5.0 stable branch.
Starting from 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) files
opened even via nonseekable_open gate read and write via lock and do not
allow them to be run simultaneously. This can create read vs write
deadlock if a filesystem is trying to implement a socket-like file which
is intended to be simultaneously used for both read and write from
filesystem client. See 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") for details and e.g. 581d21a2d0 (xenbus: fix deadlock on
writes to /proc/xen/xenbus) for a similar deadlock example on /proc/xen/xenbus.
To avoid such deadlock it was tempting to adjust fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3Dhttps://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfuse…https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfuse…https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfuse…
so if we would do such a change it will break a real user.
-> Add another flag (FOPEN_STREAM) for filesystem servers to indicate
that the opened handler is having stream-like semantics; does not use
file position and thus the kernel is free to issue simultaneous read and
write request on opened file handle.
This patch together with stream_open (10dce8af3422) should be added to
stable kernels starting from v3.14+ (the kernel where 9c225f2655 first
appeared). This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in
open handler and this way avoid the deadlock on all kernel versions. This
should work because fuse_finish_open ignores unknown open flags returned
from a filesystem and so passing FOPEN_STREAM to a kernel that is not
aware of this flag cannot hurt. In turn the kernel that is not aware of
FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
implement streams without read vs write deadlock.
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages(a)gmail.com>
Cc: Yongzhi Pan <panyongzhi(a)gmail.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: David Vrabel <david.vrabel(a)citrix.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall(a)lip6.fr>
Cc: Nikolaus Rath <Nikolaus(a)rath.org>
Cc: Han-Wen Nienhuys <hanwen(a)google.com>
Cc: stable(a)vger.kernel.org # v3.14+
Signed-off-by: Kirill Smelkov <kirr(a)nexedi.com>
---
( resending the same patch with updated description to reference stream_open
that was landed to master as 10dce8af3422; also added cc stable )
fs/fuse/file.c | 4 +++-
include/uapi/linux/fuse.h | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06096b60f1df..44de96cb7871 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -178,7 +178,9 @@ void fuse_finish_open(struct inode *inode, struct file *file)
if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
- if (ff->open_flags & FOPEN_NONSEEKABLE)
+ if (ff->open_flags & FOPEN_STREAM)
+ stream_open(inode, file);
+ else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2ac598614a8f..26abf0a571c7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -229,11 +229,13 @@ struct fuse_file_lock {
* FOPEN_KEEP_CACHE: don't invalidate the data cache on open
* FOPEN_NONSEEKABLE: the file is not seekable
* FOPEN_CACHE_DIR: allow caching this directory
+ * FOPEN_STREAM: the file is stream-like
*/
#define FOPEN_DIRECT_IO (1 << 0)
#define FOPEN_KEEP_CACHE (1 << 1)
#define FOPEN_NONSEEKABLE (1 << 2)
#define FOPEN_CACHE_DIR (1 << 3)
+#define FOPEN_STREAM (1 << 4)
/**
* INIT request/reply flags
--
2.21.0.765.geec228f530
Hi Greg and Sasha,
Please apply this commit to 4.4 through 5.0 (patches are threaded in
reply to this one), which will prevent Clang from emitting references
to compiler runtime functions and doing some performance-killing
optimization when using CONFIG_CC_OPTIMIZE_FOR_SIZE.
Please let me know if I did something wrong or if there are any
objections.
Cheers,
Nathan
As the comment notes, the return codes for TSYNC and NEW_LISTENER conflict,
because they both return positive values, one in the case of success and
one in the case of error. So, let's disallow both of these flags together.
While this is technically a userspace break, all the users I know of are
still waiting on me to land this feature in libseccomp, so I think it'll be
safe. Also, at present my use case doesn't require TSYNC at all, so this
isn't a big deal to disallow. If someone wanted to support this, a path
forward would be to add a new flag like
TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, but
the use cases are so different I don't see it really happening.
Finally, it's worth noting that this does actually fix a UAF issue: at the end
of seccomp_set_mode_filter(), we have:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret < 0) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
} else {
fd_install(listener, listener_f);
ret = listener;
}
}
out_free:
seccomp_filter_free(prepared);
But if ret > 0 because TSYNC raced, we'll install the listener fd and then free
the filter out from underneath it, causing a UAF when the task closes it or
dies. This patch also switches the condition to be simply if (ret), so that
if someone does add the flag mentioned above, they won't have to remember
to fix this too.
Signed-off-by: Tycho Andersen <tycho(a)tycho.ws>
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
CC: stable(a)vger.kernel.org # v5.0+
---
kernel/seccomp.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d0d355ded2f4..79bada51091b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -500,7 +500,10 @@ seccomp_prepare_user_filter(const char __user *user_filter)
*
* Caller must be holding current->sighand->siglock lock.
*
- * Returns 0 on success, -ve on error.
+ * Returns 0 on success, -ve on error, or
+ * - in TSYNC mode: the pid of a thread which was either not in the correct
+ * seccomp mode or did not have an ancestral seccomp filter
+ * - in NEW_LISTENER mode: the fd of the new listener
*/
static long seccomp_attach_filter(unsigned int flags,
struct seccomp_filter *filter)
@@ -1256,6 +1259,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
+ /*
+ * In the successful case, NEW_LISTENER returns the new listener fd.
+ * But in the failure case, TSYNC returns the thread that died. If you
+ * combine these two flags, there's no way to tell whether something
+ * succeded or failed. So, let's disallow this combination.
+ */
+ if ((flags & SECCOMP_FILTER_FLAG_TSYNC) &&
+ (flags && SECCOMP_FILTER_FLAG_NEW_LISTENER))
+ return -EINVAL;
+
/* Prepare the new filter before holding any locks. */
prepared = seccomp_prepare_user_filter(filter);
if (IS_ERR(prepared))
@@ -1302,7 +1315,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
mutex_unlock(¤t->signal->cred_guard_mutex);
out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
- if (ret < 0) {
+ if (ret) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
--
2.19.1
Lars Persson <lists(a)bofh.nu> reported that a label was unused in
the 4.14 version of this patchset, and the issue was present in
the 4.19 patchset as well, so I'm sending a v2 that fixes it.
The original 4.19 patchset queued for stable is OK, and
can be used as is, but this v2 is a bit better: it fixes the
unused label issue and handles overlapping fragments better.
Sorry for the mess/v2.
=======================
Currently, 4.19 and earlier stable kernels contain a security fix
that is not fully IPv6 standard compliant.
This patchset backports IPv6 defrag fixes from 5.1rc that restore
standard-compliance.
Original 5.1 patchet: https://patchwork.ozlabs.org/cover/1029418/
v2 changes: handle overlapping fragments the way it is done upstream
Peter Oskolkov (3):
net: IP defrag: encapsulate rbtree defrag code into callable functions
net: IP6 defrag: use rbtrees for IPv6 defrag
net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
include/net/inet_frag.h | 16 +-
include/net/ipv6_frag.h | 11 +-
net/ipv4/inet_fragment.c | 293 +++++++++++++++++++++++
net/ipv4/ip_fragment.c | 302 +++---------------------
net/ipv6/netfilter/nf_conntrack_reasm.c | 260 ++++++--------------
net/ipv6/reassembly.c | 240 ++++++-------------
6 files changed, 488 insertions(+), 634 deletions(-)
--
2.21.0.593.g511ec345e18-goog
Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: e4abcebedac3 - Linux 5.0.9
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out a ref:
Repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Ref: e4abcebedac3 - Linux 5.0.9
We then merged the patchset with `git am`:
bonding-fix-event-handling-for-stacked-bonds.patch
failover-allow-name-change-on-iff_up-slave-interfaces.patch
net-atm-fix-potential-spectre-v1-vulnerabilities.patch
net-bridge-fix-per-port-af_packet-sockets.patch
net-bridge-multicast-use-rcu-to-access-port-list-from-br_multicast_start_querier.patch
net-fec-manage-ahb-clock-in-runtime-pm.patch
net-fix-missing-meta-data-in-skb-with-vlan-packet.patch
net-fou-do-not-use-guehdr-after-iptunnel_pull_offloads-in-gue_udp_recv.patch
tcp-tcp_grow_window-needs-to-respect-tcp_space.patch
team-set-slave-to-promisc-if-team-is-already-in-promisc-mode.patch
tipc-missing-entries-in-name-table-of-publications.patch
vhost-reject-zero-size-iova-range.patch
ipv4-recompile-ip-options-in-ipv4_link_failure.patch
ipv4-ensure-rcu_read_lock-in-ipv4_link_failure.patch
mlxsw-spectrum_switchdev-add-mdb-entries-in-prepare-phase.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-emad-workqueue.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-mlxsw-ordered-workqueue.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-mlxsw-workqueue.patch
mlxsw-spectrum_router-do-not-check-vrf-mac-address.patch
net-thunderx-raise-xdp-mtu-to-1508.patch
net-thunderx-don-t-allow-jumbo-frames-with-xdp.patch
net-tls-fix-the-iv-leaks.patch
net-tls-don-t-leak-partially-sent-record-in-device-mode.patch
net-strparser-partially-revert-strparser-call-skb_unclone-conditionally.patch
net-tls-fix-build-without-config_tls_device.patch
net-bridge-fix-netlink-export-of-vlan_stats_per_port-option.patch
net-mlx5e-xdp-avoid-checksum-complete-when-xdp-prog-is-loaded.patch
net-mlx5e-protect-against-non-uplink-representor-for-encap.patch
net-mlx5e-switch-to-toeplitz-rss-hash-by-default.patch
net-mlx5e-rx-fixup-skb-checksum-for-packets-with-tail-padding.patch
net-mlx5e-rx-check-ip-headers-sanity.patch
revert-net-mlx5e-enable-reporting-checksum-unnecessary-also-for-l3-packets.patch
net-mlx5-fpga-tls-hold-rcu-read-lock-a-bit-longer.patch
net-tls-prevent-bad-memory-access-in-tls_is_sk_tx_device_offloaded.patch
net-mlx5-fpga-tls-idr-remove-on-flow-delete.patch
route-avoid-crash-from-dereferencing-null-rt-from.patch
nfp-flower-replace-cfi-with-vlan-present.patch
nfp-flower-remove-vlan-cfi-bit-from-push-vlan-action.patch
sch_cake-use-tc_skb_protocol-helper-for-getting-packet-protocol.patch
sch_cake-make-sure-we-can-write-the-ip-header-before-changing-dscp-bits.patch
nfc-nci-add-some-bounds-checking-in-nci_hci_cmd_received.patch
nfc-nci-potential-off-by-one-in-pipes-array.patch
sch_cake-simplify-logic-in-cake_select_tin.patch
nfit-ars-remove-ars_start_flags.patch
nfit-ars-introduce-scrub_flags.patch
nfit-ars-allow-root-to-busy-poll-the-ars-state-machi.patch
nfit-ars-avoid-stale-ars-results.patch
tpm-tpm_i2c_atmel-return-e2big-when-the-transfer-is-.patch
tpm-fix-the-type-of-the-return-value-in-calc_tpm2_ev.patch
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
kernel build: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
ppc64le:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
kernel build: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
s390x:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/s390x/kernel-stable_queue-s390x-e0…
kernel build: https://artifacts.cki-project.org/builds/s390x/kernel-stable_queue-s390x-e0…
x86_64:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
kernel build: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
✅ Boot test [0]
✅ LTP lite [1]
✅ AMTU (Abstract Machine Test Utility) [2]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ stress: stress-ng [9]
ppc64le:
✅ Boot test [0]
✅ LTP lite [1]
✅ AMTU (Abstract Machine Test Utility) [2]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ selinux-policy: serge-testsuite [10]
🚧 ✅ stress: stress-ng [9]
s390x:
✅ Boot test [0]
✅ LTP lite [1]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ stress: stress-ng [9]
x86_64:
Test source:
[0]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[1]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[2]: https://github.com/CKI-project/tests-beaker/archive/master.zip#misc/amtu
[3]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[4]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/iot…
[5]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/tun…
[6]: https://github.com/CKI-project/tests-beaker/archive/master.zip#/networking/…
[7]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/aud…
[8]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[9]: https://github.com/CKI-project/tests-beaker/archive/master.zip#stress/stres…
[10]: https://github.com/CKI-project/tests-beaker/archive/master.zip#/packages/se…
Waived tests (marked with 🚧)
-----------------------------
This test run included waived tests. Such tests are executed but their results
are not taken into account. Tests are waived when their results are not
reliable enough, e.g. when they're just introduced or are being fixed.
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)