January 2019 - Linux-stable-mirror

[PATCH] PCI: qcom: Don't deassert reset GPIO during probe

by Bjorn Andersson

Acquiring the reset GPIO low means that reset is being deasserted, this is followed almost immediately with qcom_pcie_host_init() asserting it, initializing it and then finally deasserting it again, for the link to come up. Some PCIe devices requires a minimum time between the initial deassert and subsequent reset cycles. In a platform that boots with the reset GPIO asserted this requirement is being violated by this deassert/assert pulse. Acquiring the reset GPIO high will prevent this by matching the state to the subsequent asserted state. Cc: stable(a)vger.kernel.org Signed-off-by: Bjorn Andersson <bjorn.andersson(a)linaro.org> --- drivers/pci/controller/dwc/pcie-qcom.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c index d185ea5fe996..a7f703556790 100644 --- a/drivers/pci/controller/dwc/pcie-qcom.c +++ b/drivers/pci/controller/dwc/pcie-qcom.c @@ -1228,7 +1228,7 @@ static int qcom_pcie_probe(struct platform_device *pdev) pcie->ops = of_device_get_match_data(dev); - pcie->reset = devm_gpiod_get_optional(dev, "perst", GPIOD_OUT_LOW); + pcie->reset = devm_gpiod_get_optional(dev, "perst", GPIOD_OUT_HIGH); if (IS_ERR(pcie->reset)) { ret = PTR_ERR(pcie->reset); goto err_pm_runtime_put; -- 2.18.0

6 years, 4 months

3
4
0 0

FAILED: patch "[PATCH] futex: Cure exit race" failed to apply to 4.14-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.14-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From da791a667536bf8322042e38ca85d55a78d3c273 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner <tglx(a)linutronix.de> Date: Mon, 10 Dec 2018 14:35:14 +0100 Subject: [PATCH] futex: Cure exit race Stefan reported, that the glibc tst-robustpi4 test case fails occasionally. That case creates the following race between sys_exit() and sys_futex_lock_pi(): CPU0 CPU1 sys_exit() sys_futex() do_exit() futex_lock_pi() exit_signals(tsk) No waiters: tsk->flags |= PF_EXITING; *uaddr == 0x00000PID mm_release(tsk) Set waiter bit exit_robust_list(tsk) { *uaddr = 0x80000PID; Set owner died attach_to_pi_owner() { *uaddr = 0xC0000000; tsk = get_task(PID); } if (!tsk->flags & PF_EXITING) { ... attach(); tsk->flags |= PF_EXITPIDONE; } else { if (!(tsk->flags & PF_EXITPIDONE)) return -EAGAIN; return -ESRCH; <--- FAIL } ESRCH is returned all the way to user space, which triggers the glibc test case assert. Returning ESRCH unconditionally is wrong here because the user space value has been changed by the exiting task to 0xC0000000, i.e. the FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This is a valid state and the kernel has to handle it, i.e. taking the futex. Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE is set in the task which 'owns' the futex. If the value has changed, let the kernel retry the operation, which includes all regular sanity checks and correctly handles the FUTEX_OWNER_DIED case. If it hasn't changed, then return ESRCH as there is no way to distinguish this case from malfunctioning user space. This happens when the exiting task did not have a robust list, the robust list was corrupted or the user space value in the futex was simply bogus. Reported-by: Stefan Liebler <stli(a)linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Acked-by: Peter Zijlstra <peterz(a)infradead.org> Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com> Cc: Darren Hart <dvhart(a)infradead.org> Cc: Ingo Molnar <mingo(a)kernel.org> Cc: Sasha Levin <sashal(a)kernel.org> Cc: stable(a)vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467 Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de diff --git a/kernel/futex.c b/kernel/futex.c index f423f9b6577e..5cc8083a4c89 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, return ret; } +static int handle_exit_race(u32 __user *uaddr, u32 uval, + struct task_struct *tsk) +{ + u32 uval2; + + /* + * If PF_EXITPIDONE is not yet set, then try again. + */ + if (tsk && !(tsk->flags & PF_EXITPIDONE)) + return -EAGAIN; + + /* + * Reread the user space value to handle the following situation: + * + * CPU0 CPU1 + * + * sys_exit() sys_futex() + * do_exit() futex_lock_pi() + * futex_lock_pi_atomic() + * exit_signals(tsk) No waiters: + * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID + * mm_release(tsk) Set waiter bit + * exit_robust_list(tsk) { *uaddr = 0x80000PID; + * Set owner died attach_to_pi_owner() { + * *uaddr = 0xC0000000; tsk = get_task(PID); + * } if (!tsk->flags & PF_EXITING) { + * ... attach(); + * tsk->flags |= PF_EXITPIDONE; } else { + * if (!(tsk->flags & PF_EXITPIDONE)) + * return -EAGAIN; + * return -ESRCH; <--- FAIL + * } + * + * Returning ESRCH unconditionally is wrong here because the + * user space value has been changed by the exiting task. + * + * The same logic applies to the case where the exiting task is + * already gone. + */ + if (get_futex_value_locked(&uval2, uaddr)) + return -EFAULT; + + /* If the user space value has changed, try again. */ + if (uval2 != uval) + return -EAGAIN; + + /* + * The exiting task did not have a robust list, the robust list was + * corrupted or the user space value in *uaddr is simply bogus. + * Give up and tell user space. + */ + return -ESRCH; +} + /* * Lookup the task for the TID provided from user space and attach to * it after doing proper sanity checks. */ -static int attach_to_pi_owner(u32 uval, union futex_key *key, +static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, struct futex_pi_state **ps) { pid_t pid = uval & FUTEX_TID_MASK; @@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, /* * We are the first waiter - try to look up the real owner and attach * the new pi_state to it, but bail out when TID = 0 [1] + * + * The !pid check is paranoid. None of the call sites should end up + * with pid == 0, but better safe than sorry. Let the caller retry */ if (!pid) - return -ESRCH; + return -EAGAIN; p = find_get_task_by_vpid(pid); if (!p) - return -ESRCH; + return handle_exit_race(uaddr, uval, NULL); if (unlikely(p->flags & PF_KTHREAD)) { put_task_struct(p); @@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, * set, we know that the task has finished the * cleanup: */ - int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; + int ret = handle_exit_race(uaddr, uval, p); raw_spin_unlock_irq(&p->pi_lock); put_task_struct(p); @@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval, * We are the first waiter - try to look up the owner based on * @uval and attach to it. */ - return attach_to_pi_owner(uval, key, ps); + return attach_to_pi_owner(uaddr, uval, key, ps); } static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) @@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * attach to the owner. If that fails, no harm done, we only * set the FUTEX_WAITERS bit in the user space variable. */ - return attach_to_pi_owner(uval, key, ps); + return attach_to_pi_owner(uaddr, newval, key, ps); } /**

6 years, 4 months

5
7
0 0

aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*

by Paul Menzel

Dear Greg, Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added for Linux 4.14.56 causes the aacraid module to not detect the attached devices anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz. ``` $ dmesg | grep raid [ 0.269768] raid6: sse2x1 gen() 7179 MB/s [ 0.290069] raid6: sse2x1 xor() 5636 MB/s [ 0.311068] raid6: sse2x2 gen() 9160 MB/s [ 0.332076] raid6: sse2x2 xor() 6375 MB/s [ 0.353075] raid6: sse2x4 gen() 11164 MB/s [ 0.374064] raid6: sse2x4 xor() 7429 MB/s [ 0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s [ 0.386001] raid6: .... xor() 7429 MB/s, rmw enabled [ 0.391008] raid6: using ssse3x2 recovery algorithm [ 3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006) [ 3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006) [ 10.725767] Adaptec aacraid driver 1.2.1[50834]-custom [ 10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control [ 10.743295] aacraid: Comm Interface type3 enabled $ lspci -nn | grep Adaptec 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01) 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01) ``` But, it still works with a Dell PowerEdge R715 with two eight core AMD Opteron 6136, the card below. ``` $ lspci -nn | grep Adaptec 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01) ``` Reverting the commit fixes the issue. commit ef86f3a72adb8a7931f67335560740a7ad696d1d Author: Christoph Hellwig <hch(a)lst.de> Date: Fri Jan 12 10:53:05 2018 +0800 genirq/affinity: assign vectors to all possible CPUs commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream. Currently we assign managed interrupt vectors to all present CPUs. This works fine for systems were we only online/offline CPUs. But in case of systems that support physical CPU hotplug (or the virtualized version of it) this means the additional CPUs covered for in the ACPI tables or on the command line are not catered for. To fix this we'd either need to introduce new hotplug CPU states just for this case, or we can start assining vectors to possible but not present CPUs. Reported-by: Christian Borntraeger <borntraeger(a)de.ibm.com> Tested-by: Christian Borntraeger <borntraeger(a)de.ibm.com> Tested-by: Stefan Haberland <sth(a)linux.vnet.ibm.com> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") Cc: linux-kernel(a)vger.kernel.org Cc: Thomas Gleixner <tglx(a)linutronix.de> Signed-off-by: Christoph Hellwig <hch(a)lst.de> Signed-off-by: Jens Axboe <axboe(a)kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> The problem doesn’t happen with Linux 4.17.11, so there are commits in Linux master fixing this. Unfortunately, my attempts to find out failed. I was able to cherry-pick the three commits below on top of 4.14.62, but the problem persists. 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error 355d7ecdea35 scsi: hpsa: fix selection of reply queue e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity Trying to cherry-pick the commits below, referencing the commit in question, gave conflicts. 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible To avoid further trial and error with the server with a slow firmware, do you know what commits should fix the issue? Kind regards, Paul PS: I couldn’t find, who suggested this for stable, that means how it was picked to be added to stable. Is there an easy way to find that out?

6 years, 4 months

6
20
0 0

[PATCH] kprobe: safely access memory specified by userspace

by Changbin Du

The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address. In this case, fetch_store_strlen() would crash since it uses general usercopy function. For example, we can crash the kernel by doing something as below: $ sudo kprobe 'p:do_sys_open +0(+0(%si)):string' [ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?) [ 103.622104] general protection fault: 0000 [#1] SMP PTI [ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96 [ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014 [ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0 [ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63 ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48 [ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246 [ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000 [ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000 [ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000 [ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0 [ 103.649147] Call Trace: [ 103.649781] ? sched_clock_cpu+0xc/0xa0 [ 103.650747] ? do_sys_open+0x5/0x220 [ 103.651635] kprobe_trace_func+0x303/0x380 [ 103.652645] ? do_sys_open+0x5/0x220 [ 103.653528] kprobe_dispatcher+0x45/0x50 [ 103.654682] ? do_sys_open+0x1/0x220 [ 103.655875] kprobe_ftrace_handler+0x90/0xf0 [ 103.657282] ftrace_ops_assist_func+0x54/0xf0 [ 103.658564] ? __call_rcu+0x1dc/0x280 [ 103.659482] 0xffffffffc00000bf [ 103.660384] ? __ia32_sys_open+0x20/0x20 [ 103.661682] ? do_sys_open+0x1/0x220 [ 103.662863] do_sys_open+0x5/0x220 [ 103.663988] do_syscall_64+0x60/0x210 [ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 103.666862] RIP: 0033:0x7fc22fadccdd [ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44 [ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd [ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c [ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8 [ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 103.688056] Modules linked in: [ 103.689131] ---[ end trace 43792035c28984a1 ]--- This can be fixed by using probe_mem_read() instead. Signed-off-by: Changbin Du <changbin.du(a)gmail.com> Cc: stable(a)vger.kernel.org --- kernel/trace/trace_kprobe.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..9eaf07f99212 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = { static nokprobe_inline int fetch_store_strlen(unsigned long addr) { - mm_segment_t old_fs; int ret, len = 0; u8 c; - old_fs = get_fs(); - set_fs(KERNEL_DS); - pagefault_disable(); - do { - ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1); + ret = probe_mem_read(&c, (u8 *)addr + len, 1); len++; } while (c && ret == 0 && len < MAX_STRING_SIZE); - pagefault_enable(); - set_fs(old_fs); - return (ret < 0) ? ret : len; } -- 2.19.1

6 years, 5 months

2
6
0 0

[PATCH v2] drm: Block fb changes for async plane updates

by Nicholas Kazlauskas

The prepare_fb call always happens on new_plane_state. The drm_atomic_helper_cleanup_planes checks to see if plane state pointer has changed when deciding to call cleanup_fb on either the new_plane_state or the old_plane_state. For a non-async atomic commit the state pointer is swapped, so this helper calls prepare_fb on the new_plane_state and cleanup_fb on the old_plane_state. This makes sense, since we want to prepare the framebuffer we are going to use and cleanup the the framebuffer we are no longer using. For the async atomic update helpers this differs. The async atomic update helpers perform in-place updates on the existing state. They call drm_atomic_helper_cleanup_planes but the state pointer is not swapped. This means that prepare_fb is called on the new_plane_state and cleanup_fb is called on the new_plane_state (not the old). In the case where old_plane_state->fb == new_plane_state->fb then there should be no behavioral difference between an async update and a non-async commit. But there are issues that arise when old_plane_state->fb != new_plane_state->fb. The first is that the new_plane_state->fb is immediately cleaned up after it has been prepared, so we're using a fb that we shouldn't be. The second occurs during a sequence of async atomic updates and non-async regular atomic commits. Suppose there are two framebuffers being interleaved in a double-buffering scenario, fb1 and fb2: - Async update, oldfb = NULL, newfb = fb1, prepare fb1, cleanup fb1 - Async update, oldfb = fb1, newfb = fb2, prepare fb2, cleanup fb2 - Non-async commit, oldfb = fb2, newfb = fb1, prepare fb1, cleanup fb2 We call cleanup_fb on fb2 twice in this example scenario, and any further use will result in use-after-free. The simple fix to this problem is to block framebuffer changes in the drm_atomic_helper_async_check function for now. Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch> Cc: Harry Wentland <harry.wentland(a)amd.com> Cc: Andrey Grodzovsky <andrey.grodzovsky(a)amd.com> Cc: <stable(a)vger.kernel.org> # v4.14+ Fixes: fef9df8b5945 ("drm/atomic: initial support for asynchronous plane update") Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com> --- drivers/gpu/drm/drm_atomic_helper.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 54e2ae614dcc..f4290f6b0c38 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1602,6 +1602,15 @@ int drm_atomic_helper_async_check(struct drm_device *dev, old_plane_state->crtc != new_plane_state->crtc) return -EINVAL; + /* + * FIXME: Since prepare_fb and cleanup_fb are always called on + * the new_plane_state for async updates we need to block framebuffer + * changes. This prevents use of a fb that's been cleaned up and + * double cleanups from occuring. + */ + if (old_plane_state->fb != new_plane_state->fb) + return -EINVAL; + funcs = plane->helper_private; if (!funcs->atomic_async_update) return -EINVAL; -- 2.17.1

6 years, 5 months

2
1
0 0

[PATCH] crypto: caam - Do not overwrite IV

by Sascha Hauer

In skcipher_decrypt() the IV passed in by the caller is overwritten and the tcrypt module fails with: alg: aead: decryption failed on test 1 for gcm_base(ctr-aes-caam,ghash-generic): ret=74 alg: aead: Failed to load transform for gcm(aes): -2 With this patch tcrypt runs without errors. Fixes: 115957bb3e59 ("crypto: caam - fix IV DMA mapping and updating") Signed-off-by: Sascha Hauer <s.hauer(a)pengutronix.de> --- drivers/crypto/caam/caamalg.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c index 80ae69f906fb..493fa4169382 100644 --- a/drivers/crypto/caam/caamalg.c +++ b/drivers/crypto/caam/caamalg.c @@ -1735,7 +1735,6 @@ static int skcipher_decrypt(struct skcipher_request *req) struct skcipher_edesc *edesc; struct crypto_skcipher *skcipher = crypto_skcipher_reqtfm(req); struct caam_ctx *ctx = crypto_skcipher_ctx(skcipher); - int ivsize = crypto_skcipher_ivsize(skcipher); struct device *jrdev = ctx->jrdev; u32 *desc; int ret = 0; @@ -1745,13 +1744,6 @@ static int skcipher_decrypt(struct skcipher_request *req) if (IS_ERR(edesc)) return PTR_ERR(edesc); - /* - * The crypto API expects us to set the IV (req->iv) to the last - * ciphertext block. - */ - scatterwalk_map_and_copy(req->iv, req->src, req->cryptlen - ivsize, - ivsize, 0); - /* Create and submit job descriptor*/ init_skcipher_job(req, edesc, false); desc = edesc->hw_desc; -- 2.20.1

6 years, 5 months

4
13
0 0

[PATCH] x86/speculation: Add document to describe Spectre and its mitigations

by Tim Chen

Thomas, Andi and I have made an update to our draft of the Spectre admin guide. We may be out on Christmas vacation for a while. But we want to send it out for everyone to take a look. Thanks. Tim From: Andi Kleen <ak(a)linux.intel.com> There are no document in admin guides describing Spectre v1 and v2 side channels and their mitigations in Linux. Create a document to describe Spectre and the mitigation methods used in the kernel. Signed-off-by: Andi Kleen <ak(a)linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen(a)linux.intel.com> --- Documentation/admin-guide/spectre.rst | 502 ++++++++++++++++++++++++++++++++++ 1 file changed, 502 insertions(+) create mode 100644 Documentation/admin-guide/spectre.rst diff --git a/Documentation/admin-guide/spectre.rst b/Documentation/admin-guide/spectre.rst new file mode 100644 index 0000000..0ba708e --- /dev/null +++ b/Documentation/admin-guide/spectre.rst @@ -0,0 +1,502 @@ +Spectre side channels +===================== + +Spectre is a class of side channel attacks against modern CPUs that +exploit branch prediction and speculative execution to read memory, +possibly bypassing access controls. These exploits do not modify memory. + +This document covers Spectre variant 1 and 2. + +Affected processors +------------------- + +The vulnerability affects a wide range of modern high performance +processors, since most modern high speed processors use branch prediction +and speculative execution. + +The following CPUs are vulnerable: + + - Intel Core, Atom, Pentium, Xeon CPUs + - AMD CPUs like Phenom, EPYC, Zen. + - IBM processors like POWER and zSeries + - Higher end ARM processors + - Apple CPUs + - Higher end MIPS CPUs + - Likely most other high performance CPUs. Contact your CPU vendor for details. + +This document describes the mitigations on Intel CPUs. Mitigations +on other architectures may be different. + +Related CVEs +------------ + +The following CVE entries describe Spectre variants: + + ============= ======================= ========== + CVE-2017-5753 Bounds check bypass Spectre-V1 + CVE-2017-5715 Branch target injection Spectre-V2 + +Problem +------- + +CPUs have shared caches, such as buffers for branch prediction, which are +later used to guide speculative execution. These buffers are not flushed +over context switches or change in privilege levels. Malicious software +might influence these buffers and trigger specific speculative execution +in the kernel or different user processes. This speculative execution can +then be used to read data in memory and cause side effects, such as displacing +data in a data cache. The side effect can then later be measured by the +malicious software, and used to determine the memory values read speculatively. + +Spectre attacks allow tricking other software to disclose +values in their memory. + +In a typical Spectre variant 1 attack, the attacker passes an parameter +to a victim. The victim boundary checks the parameter and rejects illegal +values. However due to speculation over branch prediction the code path +for correct values might be speculatively executed, then reference memory +controlled by the input parameter and leave measurable side effects in +the caches. The attacker could then measure these side effects +and determine the leaked value. + +There are some extensions of Spectre variant 1 attacks for reading +data over the network, see [2]. However the attacks are very +difficult, low bandwidth and fragile and considered low risk. + +For Spectre variant 2 the attacker poisons the indirect branch +predictors of the CPU. Then control is passed to the victim, which +executes indirect branches. Due to the poisoned branch predictor data +the CPU can speculatively execute arbitrary code in the victim's +address space, such as a code sequence ("disclosure gadget") that +reads arbitrary data on some input parameter and causes a measurable +cache side effect based on the value. The attacker can then measure +this side effect after gaining control again and determine the value. + +The most useful gadgets take an attacker-controlled input parameter so +that the memory read can be controlled. Gadgets without input parameters +might be possible, but the attacker would have very little control over what +memory can be read, reducing the risk of the attack revealing useful data. + +Attack scenarios +---------------- + +Here is a list of attack scenarios that have been anticipated, but +may not cover all possible attack patterns. Reduing the occurrences of +attack pre-requisites listed can reduce the risk that a spectre attack +leaks useful data. + +1. Local User process attacking kernel +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Code in system calls often enforces access controls with conditional +branches based on user data. These branches are potential targets for +Spectre v2 exploits. Interrupt handlers, on the other hand, rarely +handle user data or enforce access controls, which makes them unlikely +exploit targets. + +For typical variant 2 attack, the attacker may poison the CPU branch +buffers first, and then enter the kernel and trick it into jumping to a +disclosure gadget through an indirect branch. If the attacker wants to control the +memory addresses leaked, it would also need to pass a parameter +to the gadget, either through a register or through a known address in +memory. Finally when it executes again it can measure the side effect. + +Necessary Prequisites: +1. Malicious local process passing parameters to kernel +2. Kernel has secrets. + +2. User process attacking another user process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In this scenario an malicious user process wants to attack another +user process through a context switch. + +For variant 1 this generally requires passing some parameter between +the processes, which needs a data passing relationship, such a remote +procedure calls (RPC). + +For variant 2 the poisoning can happen through a context switch, or +on CPUs with simultaneous multi-threading (SMT) potentially on the +thread sibling executing in parallel on the same core. In either case, +controlling the memory leaked by the disclosure gadget also requires a data +passing relationship to the victim process, otherwise while it may +observe values through side effects, it won't know which memory +addresses they relate to. + +Necessary Prerequisites: +1. Malicious code running as local process +2. Victim processes containing secrets running on same core. + +3. User sandbox attacking runtime in process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A process, such as a web browser, might be running interpreted or JITed +untrusted code, such as javascript code downloaded from a website. +It uses restrictions in the JIT code generator and checks in a run time +to prevent the untrusted code from attacking the hosting process. + +The untrusted code might either use variant 1 or 2 to trick +a disclosure gadget in the run time to read memory inside the process. + +Necessary Prerequisites: +1. Sandbox in process running untrusted code. +2. Runtime in same process containing secrets. + +4. Kernel sandbox attacking kernel +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The kernel has support for running user-supplied programs within the +kernel. Specific rules (such as bounds checking) are enforced on these +programs by the kernel to ensure that they do not violate access controls. + +eBPF is a kernel sub-system that uses user-supplied program +to execute JITed untrusted byte code inside the kernel. eBPF is used +for manipulating and examining network packets, examining system call +parameters for sand boxes and other uses. + +A malicious local process could upload and trigger an malicious +eBPF script to the kernel, with the script attacking the kernel +using variant 1 or 2 and reading memory. + +Necessary Prerequisites: +1. Malicious local process +2. eBPF JIT enabled for unprivileged users, attacking kernel with secrets +on the same machine. + +5. Virtualization guest attacking host +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An untrusted guest might attack the host through a hyper call +or other virtualization exit. + +Necessary Prerequisites: +1. Untrusted guest attacking host +2. Host has secrets on local machine. + +For variant 1 VM exits use appropriate mitigations +("bounds clipping") to prevent speculation leaking data +in kernel code. For variant 2 the kernel flushes the branch buffer. + +6. Virtualization guest attacking other guest +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An untrusted guest attacking another guest containing +secrets. Mitigations are similar to when a guest attack +the host. + +Runtime vulnerability information +--------------------------------- + +The kernel reports the vulnerability and mitigation status in +/sys/devices/system/cpu/vulnerabilities/* + +The spectre_v1 file describes the always enabled variant 1 +mitigation: + +/sys/devices/system/cpu/vulnerabilities/spectre_v1 + +The value in this file: + + ======================================= ================================= + 'Mitigation: __user pointer sanitation' Protection in kernel on a case by + case base with explicit pointer + sanitation. + ======================================= ================================= + +The spectre_v2 kernel file reports if the kernel has been compiled with a +retpoline aware compiler, if the CPU has hardware mitigation, and if the +CPU has microcode support for additional process specific mitigations. + +It also reports CPU features enabled by microcode to mitigate attack +between user processes: + +1. Indirect Branch Prediction Barrier (IBPB) to add additional + isolation between processes of different users +2. Single Thread Indirect Branch Prediction (STIBP) to additional + isolation between CPU threads running on the same core. + +These CPU features may impact performance when used and can +be enabled per process on a case-by-case base. + +/sys/devices/system/cpu/vulnerabilities/spectre_v2 + +The values in this file: + + - Kernel status: + + ==================================== ================================= + 'Not affected' The processor is not vulnerable + 'Vulnerable' Vulnerable, no mitigation + 'Mitigation: Full generic retpoline' Software-focused mitigation + 'Mitigation: Full AMD retpoline' AMD-specific software mitigation + 'Mitigation: Enhanced IBRS' Hardware-focused mitigation + ==================================== ================================= + + - Firmware status: + + ========== ============================================================= + 'IBRS_FW' Protection against user program attacks when calling firmware + ========== ============================================================= + + - Indirect branch prediction barrier (IBPB) status for protection between + processes of different users. This feature can be controlled through + prctl per process, or through kernel command line options. For more details + see below. + + =================== ======================================================== + 'IBPB: disabled' IBPB unused + 'IBPB: always-on' Use IBPB on all tasks + 'IBPB: conditional' Use IBPB on SECCOMP or indirect branch restricted tasks + =================== ======================================================== + + - Single threaded indirect branch prediction (STIBP) status for protection + between different hyper threads. This feature can be controlled through + prctl per process, or through kernel command line options. For more details + see below. + + ==================== ======================================================== + 'STIBP: disabled' STIBP unused + 'STIBP: forced' Use STIBP on all tasks + 'STIBP: conditional' Use STIBP on SECCOMP or indirect branch restricted tasks + ==================== ======================================================== + + - Return stack buffer (RSB) protection status: + + ============= =========================================== + 'RSB filling' Protection of RSB on context switch enabled + ============= =========================================== + +Full mitigations might require an microcode update from the CPU +vendor. When the necessary microcode is not available the kernel +will report vulnerability. + +Kernel mitigation +----------------- + +The kernel has default on mitigations for Variant 1 and Variant 2 +against attacks from user programs or guests. For variant 1 it +annotates vulnerable kernel code (as determined by the sparse code +scanning tool and code audits) to use "bounds clipping" to avoid any +usable disclosure gadgets. + +For variant 2 the kernel employs "retpoline" with compiler help to secure +the indirect branches inside the kernel, when CONFIG_RETPOLINE is enabled +and the compiler supports retpoline. On Intel Skylake-era systems the +mitigation covers most, but not all, cases, see [1] for more details. + +On CPUs with hardware mitigations for variant 2, retpoline is +automatically disabled at runtime. + +Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y +and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) +makes attacks on the kernel generally more difficult. + +Host mitigation +--------------- + +The Linux kernel uses retpoline to eliminate attacks on indirect +branches. It also flushes the Return Branch Stack on every VM exit to +prevent guests from attacking the host kernel when retpoline is +enabled. + +Variant 1 attacks are mitigated unconditionally. + +The kernel also allows guests to use any microcode based mitigations +they chose to use (such as IBPB or STIBP), assuming the +host has an updated microcode and reports the feature in +/sys/devices/system/cpu/vulnerabilities/spectre_v2. + +Mitigation control at kernel build time +--------------------------------------- + +When the CONFIG_RETPOLINE option is enabled the kernel uses special +code sequences to avoid attacks on indirect branches through +Variant 2 attacks. + +The compiler also needs to support retpoline and support the +-mindirect-branch=thunk-extern -mindirect-branch-register options +for gcc, or -mretpoline-external-thunk option for clang. + +When the compiler doesn't support these options the kernel +will report that it is vulnerable. + +Variant 1 mitigations and other side channel related user APIs are +enabled unconditionally. + +Hardware mitigation +------------------- + +Some CPUs have hardware mitigations (e.g. enhanced IBRS) for Spectre +variant 2. The 4.19 kernel has support for detecting this capability +and automatically disable any unnecessary workarounds at runtime. + +User program mitigation +----------------------- + +For variant 1 user programs can use LFENCE or bounds clipping. For more +details see [3]. + +For variant 2 user programs can be compiled with retpoline or +restricting its indirect branch speculation via prctl. (See +Documenation/speculation.txt for detailed API.) + +User programs should use address space randomization +(/proc/sys/kernel/randomize_va_space = 1 or 2) to make any attacks +more difficult. + +Mitigation control on the kernel command line +--------------------------------------------- + +Spectre v2 mitigations can be disabled and force enabled at the kernel +command line. + + nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2 + (indirect branch prediction) vulnerability. System may + allow data leaks with this option, which is equivalent + to spectre_v2=off. + + + spectre_v2= [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability. + The default operation protects the kernel from + user space attacks. + + on - unconditionally enable, implies + spectre_v2_user=on + off - unconditionally disable, implies + spectre_v2_user=off + auto - kernel detects whether your CPU model is + vulnerable + + Selecting 'on' will, and 'auto' may, choose a + mitigation method at run time according to the + CPU, the available microcode, the setting of the + CONFIG_RETPOLINE configuration option, and the + compiler with which the kernel was built. + + Selecting 'on' will also enable the mitigation + against user space to user space task attacks. + + Selecting 'off' will disable both the kernel and + the user space protections. + + Specific mitigations can also be selected manually: + + retpoline - replace indirect branches + retpoline,generic - google's original retpoline + retpoline,amd - AMD-specific minimal thunk + + Not specifying this option is equivalent to + spectre_v2=auto. + +For user space mitigation: + + spectre_v2_user= + [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability between + user space tasks + + on - Unconditionally enable mitigations. Is + enforced by spectre_v2=on + + off - Unconditionally disable mitigations. Is + enforced by spectre_v2=off + + prctl - Indirect branch speculation is enabled, + but mitigation can be enabled via prctl + per thread. The mitigation control state + is inherited on fork. + + prctl,ibpb + - Like "prctl" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different user + space processes. + + seccomp + - Same as "prctl" above, but all seccomp + threads will enable the mitigation unless + they explicitly opt out. + + seccomp,ibpb + - Like "seccomp" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different + user space processes. + + auto - Kernel selects the mitigation depending on + the available CPU features and vulnerability. + + Default mitigation: + If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl" + + Not specifying this option is equivalent to + spectre_v2_user=auto. + + In general the kernel by default selects + reasonable mitigations for the current CPU. To + disable Spectre v2 mitigations boot with + spectre_v2=off. Spectre v1 mitigations cannot + be disabled. + +APIs for mitigation control of user process +------------------------------------------- + +When enabling the "prctl" option for spectre_v2_user boot parameter, +prctl can be used to restrict indirect branch speculation on a process. +See Documenation/speculation.txt for detailed API. + +Processes containing secrets, such as cryptographic keys, may invoke +this prctl for extra protection against Spectre v2. + +Before running untrusted processes, restricting their indirect branch +speculation will prevent such processes from launching Spectre v2 attacks. + +Restricting indirect branch speuclation on a process should be only used +as needed, as restricting speculation reduces both performance of the +process, and also process running on the sibling CPU thread. + +Under the "seccomp" option, the processes sandboxed with SECCOMP will +have indirect branch speculation restricted automatically. + +References +---------- + +Intel white papers and documents on Spectre: + +https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf + +[1] +https://software.intel.com/security-software-guidance/api-app/sites/default/files/Retpoline-A-Branch-Target-Injection-Mitigation.pdf + +https://www.intel.com/content/www/us/en/architecture-and-technology/facts-about-side-channel-analysis-and-intel-products.html + +[3] https://software.intel.com/security-software-guidance/ + +https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors + +AMD white papers: + +https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf + +https://www.amd.com/en/corporate/security-updates + +ARM white papers: + +https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper + +https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update + +MIPS: + +https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/ + +Academic papers: + +https://spectreattack.com/spectre.pdf [original spectre paper] + +[2] https://arxiv.org/abs/1807.10535 [NetSpectre] + +https://arxiv.org/abs/1811.05441 [generalization of Spectre] + +https://arxiv.org/abs/1807.07940 [Spectre RSB, a variant of Spectre v2] -- 2.9.4

6 years, 5 months

8
24
0 0

[PATCH] ext4: Fix crash during online resizing

by Jan Kara

When computing maximum size of filesystem possible with given number of group descriptor blocks, we forget to include s_first_data_block into the number of blocks. Thus for filesystems with non-zero s_first_data_block it can happen that computed maximum filesystem size is actually lower than current filesystem size which confuses the code and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on flex_gd->count == 0. The problem can be reproduced like: truncate -s 100g /tmp/image mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768 mount -t ext4 -o loop /tmp/image /mnt resize2fs /dev/loop0 262145 resize2fs /dev/loop0 300000 Fix the problem by properly including s_first_data_block into the computed number of filesystem blocks. CC: stable(a)vger.kernel.org Fixes: 1c6bd7173d66 "ext4: convert file system to meta_bg if needed..." Signed-off-by: Jan Kara <jack(a)suse.cz> --- fs/ext4/resize.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 48421de803b7..3d9b18505c0c 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1960,7 +1960,8 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count) le16_to_cpu(es->s_reserved_gdt_blocks); n_group = n_desc_blocks * EXT4_DESC_PER_BLOCK(sb); n_blocks_count = (ext4_fsblk_t)n_group * - EXT4_BLOCKS_PER_GROUP(sb); + EXT4_BLOCKS_PER_GROUP(sb) + + le32_to_cpu(es->s_first_data_block); n_group--; /* set to last group number */ } -- 2.16.4

6 years, 5 months

2
1
0 0

Backporting dwc3 gadget fixes

by Evan Green

Hello stablers, With the following revert being backported to stable: a9c859033f6ec Revert "usb: gadget: ffs: Fix BUG when userland exits with submitted AIO transfers" The original bug it fixed is back. I wonder if we should be backporting the series that seems to quietly fix that issue: fec9095bdef4e usb: dwc3: gadget: remove wait_end_transfer d4f1afe5e896c usb: dwc3: gadget: move requests to cancelled_list d5443bbf5fc8f usb: dwc3: gadget: introduce cancelled_list 7746a8dfb3f9c usb: dwc3: gadget: extract dwc3_gadget_ep_skip_trbs() c3acd59014148 usb: dwc3: gadget: use num_trbs when skipping TRBs on ->dequeue() 09fe1f8d7e2f4 usb: dwc3: gadget: track number of TRBs per request 1a22ec6435806 usb: dwc3: gadget: combine unaligned and zero flags (Patch 1/8 of the original series was already backported). I know we saw this with 4.19, I'm not sure which other versions it would go into. I'll re-paste the stack from the original commit that got reverted. I can easily reproduce this by connecting a host when our device is in gadget mode, then attempting to gracefully reboot the system: [ 382.200896] BUG: scheduling while atomic: screen/1808/0x00000100 [ 382.207124] 4 locks held by screen/1808: [ 382.211266] #0: (rcu_callback){....}, at: [<c10b4ff0>] rcu_process_callbacks+0x260/0x440 [ 382.219949] #1: (rcu_read_lock_sched){....}, at: [<c1358ba0>] percpu_ref_switch_to_atomic_rcu+0xb0/0x130 [ 382.230034] #2: (&(&ctx->ctx_lock)->rlock){....}, at: [<c11f0c73>] free_ioctx_users+0x23/0xd0 [ 382.230096] #3: (&(&ffs->eps_lock)->rlock){....}, at: [<f81e7710>] ffs_aio_cancel+0x20/0x60 [usb_f_fs] [ 382.230160] Modules linked in: usb_f_fs libcomposite configfs bnep btsdio bluetooth ecdh_generic brcmfmac brcmutil intel_powerclamp coretemp dwc3 kvm_intel ulpi udc_core kvm irqbypass crc32_pclmul crc32c_intel pcbc dwc3_pci aesni_intel aes_i586 crypto_simd cryptd ehci_pci ehci_hcd gpio_keys usbcore basincove_gpadc industrialio usb_common [ 382.230407] CPU: 1 PID: 1808 Comm: screen Not tainted 4.14.0-edison+ #117 [ 382.230416] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48 [ 382.230425] Call Trace: [ 382.230438] <SOFTIRQ> [ 382.230466] dump_stack+0x47/0x62 [ 382.230498] __schedule_bug+0x61/0x80 [ 382.230522] __schedule+0x43/0x7a0 [ 382.230587] schedule+0x5f/0x70 [ 382.230625] dwc3_gadget_ep_dequeue+0x14c/0x270 [dwc3] [ 382.230669] ? do_wait_intr_irq+0x70/0x70 [ 382.230724] usb_ep_dequeue+0x19/0x90 [udc_core] [ 382.230770] ffs_aio_cancel+0x37/0x60 [usb_f_fs] [ 382.230798] kiocb_cancel+0x31/0x40 [ 382.230822] free_ioctx_users+0x4d/0xd0 [ 382.230858] percpu_ref_switch_to_atomic_rcu+0x10a/0x130 [ 382.230881] ? percpu_ref_exit+0x40/0x40 [ 382.230904] rcu_process_callbacks+0x2b3/0x440 [ 382.230965] __do_softirq+0xf8/0x26b [ 382.231011] ? __softirqentry_text_start+0x8/0x8 [ 382.231033] do_softirq_own_stack+0x22/0x30 [ 382.231042] </SOFTIRQ> [ 382.231071] irq_exit+0x45/0xc0 [ 382.231089] smp_apic_timer_interrupt+0x13c/0x150 [ 382.231118] apic_timer_interrupt+0x35/0x3c Felipe/others, any thoughts about this? -Evan

6 years, 5 months

4
8
0 0

[PATCH for v4.9] fs: don't scan the inode cache before SB_BORN is set

by Aaron Lu

One of our servers recently hit a kernel crash and the callstack is: [6469391.997662] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070 [6469392.005693] IP: [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20 [6469392.012412] PGD 1000c21067 [6469392.015203] PUD ffc306067 [6469392.018089] PMD 0 [6469392.018627] [6469392.020303] Oops: 0000 [#1] SMP [6469392.023621] Modules linked in: kpatch_6iljwh9b(OE) memcg_force_swapin(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd auth_rpcgss nfs_acl [last unloaded: memcg_force_swapin] [6469392.040177] CPU: 2 PID: 89058 Comm: ilogtail Tainted: G OE K 4.9.93-010.ali3000.alios7.x86_64 #1 [6469392.049996] Hardware name: Inventec K900-1G /B900G2-1G , BIOS A2.32 10/09/2014 [6469392.060334] task: ffff8802217b1800 task.stack: ffffc9004ea88000 [6469392.066418] RIP: 0010:[<ffffffff811cad80>] [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20 [6469392.075563] RSP: 0018:ffffc9004ea8b6c0 EFLAGS: 00010282 [6469392.081041] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000001 [6469392.088339] RDX: 0000000000000001 RSI: ffffc9004ea8b780 RDI: ffff881749bd2000 [6469392.095635] RBP: ffffc9004ea8b6c0 R08: 28f5c28f5c28f5c3 R09: ffff88173bf3fce0 [6469392.102934] R10: ffff88207ffd4000 R11: 0000000000000000 R12: ffff881749bd24c0 [6469392.110233] R13: ffffc9004ea8b780 R14: 0000000000000000 R15: ffff88207ffd4000 [6469392.117533] FS: 00007fe260420700(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000 [6469392.125792] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [6469392.131703] CR2: 0000000000000070 CR3: 00000005bb46d000 CR4: 00000000001606f0 [6469392.138999] Stack: [6469392.141185] ffffc9004ea8b6f0 ffffffff81247bee 0000000000000020 0000000000000400 [6469392.148811] ffff881749bd24c0 0000000000000000 ffffc9004ea8b7d0 ffffffff811c431c [6469392.156436] 0000000000000020 0000000000000000 ffff88207b82c000 0000000000000001 [6469392.164063] Call Trace: [6469392.166692] [<ffffffff81247bee>] super_cache_count+0x3e/0xe0 [6469392.172607] [<ffffffff811c431c>] shrink_slab.part.38+0x11c/0x420 [6469392.178875] [<ffffffff811c4649>] shrink_slab+0x29/0x30 [6469392.184273] [<ffffffff811c93cf>] shrink_node+0xff/0x300 [6469392.189756] [<ffffffff811c96dd>] do_try_to_free_pages+0x10d/0x330 [6469392.196104] [<ffffffff811c9b65>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [6469392.203063] [<ffffffff81230b5d>] try_charge+0x14d/0x720 [6469392.208551] [<ffffffff8121b8e3>] ? kmem_cache_alloc+0xd3/0x1a0 [6469392.214642] [<ffffffff811b14e5>] ? mempool_alloc_slab+0x15/0x20 [6469392.220825] [<ffffffff81235b4e>] mem_cgroup_try_charge+0x6e/0x1b0 [6469392.227177] [<ffffffff811ae174>] __add_to_page_cache_locked+0x64/0x220 [6469392.233961] [<ffffffff811ae39e>] add_to_page_cache_lru+0x4e/0xe0 [6469392.240242] [<ffffffffa03ce2d1>] ext4_mpage_readpages+0x151/0x980 [ext4] [6469392.247211] [<ffffffffa037edb5>] ext4_readpages+0x35/0x40 [ext4] [6469392.253474] [<ffffffff811be9e7>] __do_page_cache_readahead+0x197/0x240 [6469392.260260] [<ffffffff811ae45c>] ? pagecache_get_page+0x2c/0x2a0 [6469392.266523] [<ffffffff811b0f4b>] filemap_fault+0x4db/0x590 [6469392.272282] [<ffffffffa0388fd6>] ext4_filemap_fault+0x36/0x50 [ext4] [6469392.278896] [<ffffffff811e4a90>] __do_fault+0x80/0x170 [6469392.284292] [<ffffffff811e87b2>] do_fault+0x4c2/0x720 [6469392.289603] [<ffffffff8111513f>] ? futex_wait_queue_me+0x9f/0x120 [6469392.295954] [<ffffffff811e9162>] handle_mm_fault+0x512/0xc90 [6469392.301874] [<ffffffff8106eb8b>] __do_page_fault+0x24b/0x4d0 [6469392.307796] [<ffffffff811184c5>] ? SyS_futex+0x85/0x170 [6469392.313280] [<ffffffff8106ee40>] do_page_fault+0x30/0x80 [6469392.318850] [<ffffffff81003bf4>] ? do_syscall_64+0x74/0x180 [6469392.324679] [<ffffffff81722b68>] page_fault+0x28/0x30 [6469392.329986] Code: 00 48 83 43 38 01 4c 89 e7 c6 <48> 8b 40 70 5d c3 66 2e 0f 1f 84 [6469392.338183] RIP [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20 [6469392.344990] RSP <ffffc9004ea8b6c0> [6469392.348656] CR2: 0000000000000070 Google showed me Dave Chinner's fix and I think it is the right fix for our problem(not easy to reproduce in our production environment so I haven't been able to confirm). Unfortunately, this commit is only back ported to v4.14 and v4.16 stable kernel, not v4.9 stable kernel, presumbly due to the rename of MS_BORN to SB_BORN starting from v4.14. To make this patch work on v4.9, I have done one minor change to Dave's commit: by keep using MS_BORN. I think this is correct, but since I know very little about fs code, please kindly review, thanks a lot for your time. >From 5cdf1679c9120a173a2bc9dff214332e99f741bc Mon Sep 17 00:00:00 2001 From: Dave Chinner <dchinner(a)redhat.com> Date: Fri, 11 May 2018 11:20:57 +1000 Subject: [PATCH] fs: don't scan the inode cache before SB_BORN is set commit 79f546a696bff2590169fb5684e23d65f4d9f591 upstream. We recently had an oops reported on a 4.14 kernel in xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage and so the m_perag_tree lookup walked into lala land. It produces an oops down this path during the failed mount: radix_tree_gang_lookup_tag+0xc4/0x130 xfs_perag_get_tag+0x37/0xf0 xfs_reclaim_inodes_count+0x32/0x40 xfs_fs_nr_cached_objects+0x11/0x20 super_cache_count+0x35/0xc0 shrink_slab.part.66+0xb1/0x370 shrink_node+0x7e/0x1a0 try_to_free_pages+0x199/0x470 __alloc_pages_slowpath+0x3a1/0xd20 __alloc_pages_nodemask+0x1c3/0x200 cache_grow_begin+0x20b/0x2e0 fallback_alloc+0x160/0x200 kmem_cache_alloc+0x111/0x4e0 The problem is that the superblock shrinker is running before the filesystem structures it depends on have been fully set up. i.e. the shrinker is registered in sget(), before ->fill_super() has been called, and the shrinker can call into the filesystem before fill_super() does it's setup work. Essentially we are exposed to both use-after-free and use-before-initialisation bugs here. To fix this, add a check for the SB_BORN flag in super_cache_count. In general, this flag is not set until ->fs_mount() completes successfully, so we know that it is set after the filesystem setup has completed. This matches the trylock_super() behaviour which will not let super_cache_scan() run if SB_BORN is not set, and hence will not allow the superblock shrinker from entering the filesystem while it is being set up or after it has failed setup and is being torn down. Cc: stable(a)kernel.org Signed-Off-By: Dave Chinner <dchinner(a)redhat.com> Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk> Signed-off-by: Aaron Lu <aaron.lu(a)linux.alibaba.com> --- fs/super.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/super.c b/fs/super.c index 7e9beab77259..abe2541fb28c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -119,13 +119,23 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); /* - * Don't call trylock_super as it is a potential - * scalability bottleneck. The counts could get updated - * between super_cache_count and super_cache_scan anyway. - * Call to super_cache_count with shrinker_rwsem held - * ensures the safety of call to list_lru_shrink_count() and - * s_op->nr_cached_objects(). + * We don't call trylock_super() here as it is a scalability bottleneck, + * so we're exposed to partial setup state. The shrinker rwsem does not + * protect filesystem operations backing list_lru_shrink_count() or + * s_op->nr_cached_objects(). Counts can change between + * super_cache_count and super_cache_scan, so we really don't need locks + * here. + * + * However, if we are currently mounting the superblock, the underlying + * filesystem might be in a state of partial construction and hence it + * is dangerous to access it. trylock_super() uses a MS_BORN check to + * avoid this situation, so do the same here. The memory barrier is + * matched with the one in mount_fs() as we don't hold locks here. */ + if (!(sb->s_flags & MS_BORN)) + return 0; + smp_rmb(); + if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc); @@ -1193,6 +1203,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) sb = root->d_sb; BUG_ON(!sb); WARN_ON(!sb->s_bdi); + + /* + * Write barrier is for super_cache_count(). We place it before setting + * MS_BORN as the data dependency between the two functions is the + * superblock structure contents that we just set up, not the MS_BORN + * flag. + */ + smp_wmb(); sb->s_flags |= MS_BORN; error = security_sb_kern_mount(sb, flags, secdata); -- 2.19.1.3.ge56e4f7

6 years, 5 months

2
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror January 2019