The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a2308c11ecbc3471ebb7435ee8075815b1502ef0 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 18 Nov 2019 13:09:52 +0100
Subject: [PATCH] s390/smp,vdso: fix ASCE handling
When a secondary CPU is brought up it must initialize its control
registers. CPU A which triggers that a secondary CPU B is brought up
stores its control register contents into the lowcore of new CPU B,
which then loads these values on startup.
This is problematic in various ways: the control register which
contains the home space ASCE will correctly contain the kernel ASCE;
however control registers for primary and secondary ASCEs are
initialized with whatever values were present in CPU A.
Typically:
- the primary ASCE will contain the user process ASCE of the process
that triggered onlining of CPU B.
- the secondary ASCE will contain the percpu VDSO ASCE of CPU A.
Due to lazy ASCE handling we may also end up with other combinations.
When then CPU B switches to a different process (!= idle) it will
fixup the primary ASCE. However the problem is that the (wrong) ASCE
from CPU A was loaded into control register 1: as soon as an ASCE is
attached (aka loaded) a CPU is free to generate TLB entries using that
address space.
Even though it is very unlikey that CPU B will actually generate such
entries, this could result in TLB entries of the address space of the
process that ran on CPU A. These entries shouldn't exist at all and
could cause problems later on.
Furthermore the secondary ASCE of CPU B will not be updated correctly.
This means that processes may see wrong results or even crash if they
access VDSO data on CPU B. The correct VDSO ASCE will eventually be
loaded on return to user space as soon as the kernel executed a call
to strnlen_user or an atomic futex operation on CPU B.
Fix both issues by intializing the to be loaded control register
contents with the correct ASCEs and also enforce (re-)loading of the
ASCEs upon first context switch and return to user space.
Fixes: 0aaba41b58bc ("s390: remove all code using the access register mode")
Cc: stable(a)vger.kernel.org # v4.15+
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Vasily Gorbik <gor(a)linux.ibm.com>
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 6acdcf1d4074..06dddd7c4290 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -262,10 +262,13 @@ static void pcpu_prepare_secondary(struct pcpu *pcpu, int cpu)
lc->spinlock_index = 0;
lc->percpu_offset = __per_cpu_offset[cpu];
lc->kernel_asce = S390_lowcore.kernel_asce;
+ lc->user_asce = S390_lowcore.kernel_asce;
lc->machine_flags = S390_lowcore.machine_flags;
lc->user_timer = lc->system_timer =
lc->steal_timer = lc->avg_steal_timer = 0;
__ctl_store(lc->cregs_save_area, 0, 15);
+ lc->cregs_save_area[1] = lc->kernel_asce;
+ lc->cregs_save_area[7] = lc->vdso_asce;
save_access_regs((unsigned int *) lc->access_regs_save_area);
memcpy(lc->stfle_fac_list, S390_lowcore.stfle_fac_list,
sizeof(lc->stfle_fac_list));
@@ -844,6 +847,8 @@ static void smp_init_secondary(void)
S390_lowcore.last_update_clock = get_tod_clock();
restore_access_regs(S390_lowcore.access_regs_save_area);
+ set_cpu_flag(CIF_ASCE_PRIMARY);
+ set_cpu_flag(CIF_ASCE_SECONDARY);
cpu_init();
preempt_disable();
init_cpu_timer();
From: Leo Yan <leo.yan(a)linaro.org>
The test case 'Read backward ring buffer' failed on 32-bit architectures
which were found by LKFT perf testing. The test failed on arm32 x15
device, qemu_arm32, qemu_i386, and found intermittent failure on i386;
the failure log is as below:
50: Read backward ring buffer :
--- start ---
test child forked, pid 510
Using CPUID GenuineIntel-6-9E-9
mmap size 1052672B
mmap size 8192B
Finished reading overwrite ring buffer: rewind
free(): invalid next size (fast)
test child interrupted
---- end ----
Read backward ring buffer: FAILED!
The log hints there have issue for memory usage, thus free() reports
error 'invalid next size' and directly exit for the case. Finally, this
issue is root caused as out of bounds memory access for the data array
'evsel->id'.
The backward ring buffer test invokes do_test() twice. 'evsel->id' is
allocated at the first call with the flow:
test__backward_ring_buffer()
`-> do_test()
`-> evlist__mmap()
`-> evlist__mmap_ex()
`-> perf_evsel__alloc_id()
So 'evsel->id' is allocated with one item, and it will be used in
function perf_evlist__id_add():
evsel->id[0] = id
evsel->ids = 1
At the second call for do_test(), it skips to initialize 'evsel->id'
and reuses the array which is allocated in the first call. But
'evsel->ids' contains the stale value. Thus:
evsel->id[1] = id -> out of bound access
evsel->ids = 2
To fix this issue, we will use evlist__open() and evlist__close() pair
functions to prepare and cleanup context for evlist; so 'evsel->id' and
'evsel->ids' can be initialized properly when invoke do_test() and avoid
the out of bounds memory access.
Fixes: ee74701ed8ad ("perf tests: Add test to check backward ring buffer")
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
Reviewed-by: Jiri Olsa <jolsa(a)kernel.org>
Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Wang Nan <wangnan0(a)huawei.com>
Cc: stable(a)vger.kernel.org # v4.10+
Link: http://lore.kernel.org/lkml/20191107020244.2427-1-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
---
tools/perf/tests/backward-ring-buffer.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/perf/tests/backward-ring-buffer.c b/tools/perf/tests/backward-ring-buffer.c
index a4cd30c0beb3..15cea518f5ad 100644
--- a/tools/perf/tests/backward-ring-buffer.c
+++ b/tools/perf/tests/backward-ring-buffer.c
@@ -148,6 +148,15 @@ int test__backward_ring_buffer(struct test *test __maybe_unused, int subtest __m
goto out_delete_evlist;
}
+ evlist__close(evlist);
+
+ err = evlist__open(evlist);
+ if (err < 0) {
+ pr_debug("perf_evlist__open: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ goto out_delete_evlist;
+ }
+
err = do_test(evlist, 1, &sample_count, &comm_count);
if (err != TEST_OK)
goto out_delete_evlist;
--
2.21.0
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 625110b5e9dae9074d8a7e67dd07f821a053eed7 Mon Sep 17 00:00:00 2001
From: Thomas Hellstrom <thellstrom(a)vmware.com>
Date: Sat, 30 Nov 2019 17:51:32 -0800
Subject: [PATCH] mm/memory.c: fix a huge pud insertion race during faulting
A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault(). That will lead to pmd_alloc() returning an
invalid pmd pointer.
Fix this by adding a pud_trans_unstable() function similar to
pmd_trans_unstable() and check whether the pud is really stable before
using the pmd pointer.
Race:
Thread 1: Thread 2: Comment
create_huge_pud() Fallback - not taken.
create_huge_pud() Taken.
pmd_alloc() Returns an invalid pointer.
This will result in user-visible huge page data corruption.
Note that this was caught during a code audit rather than a real
experienced problem. It looks to me like the only implementation that
currently creates huge pud pagetable entries is dev_dax_huge_fault()
which doesn't appear to care much about private (COW) mappings or
write-tracking which is, I believe, a prerequisite for create_huge_pud()
falling back on thread 1, but not in thread 2.
Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom <thellstrom(a)vmware.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 3127f9028f54..798ea36a0549 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -938,6 +938,31 @@ static inline int pud_trans_huge(pud_t pud)
}
#endif
+/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
+static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
+{
+ pud_t pudval = READ_ONCE(*pud);
+
+ if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+ return 1;
+ if (unlikely(pud_bad(pudval))) {
+ pud_clear_bad(pud);
+ return 1;
+ }
+ return 0;
+}
+
+/* See pmd_trans_unstable for discussion. */
+static inline int pud_trans_unstable(pud_t *pud)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+ return pud_none_or_trans_huge_or_dev_or_clear_bad(pud);
+#else
+ return 0;
+#endif
+}
+
#ifndef pmd_read_atomic
static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
{
diff --git a/mm/memory.c b/mm/memory.c
index 62b5cce653f6..c3902201989f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4010,6 +4010,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
+retry_pud:
if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
@@ -4036,6 +4037,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
+
+ /* Huge pud page fault raced with pmd_alloc? */
+ if (pud_trans_unstable(vmf.pud))
+ goto retry_pud;
+
if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))