The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 06c5fe9b12dde1b62821f302f177c972bb1c81f9 Mon Sep 17 00:00:00 2001
From: Xiaochen Shen <xiaochen.shen(a)intel.com>
Date: Fri, 4 Dec 2020 14:27:59 +0800
Subject: [PATCH] x86/resctrl: Fix incorrect local bandwidth when mba_sc is
enabled
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
rdtgroup_mondata_show()
mon_event_read()
mon_event_count()
__mon_event_count()
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Before fix:
local b/w (bytes/s): 11076796416
After fix:
local b/w (bytes/s): 5465014272
Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@i…
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 54dffe574e67..a98519a3a2e6 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -279,7 +279,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
return;
chunks = mbm_overflow_count(m->prev_bw_msr, tval, rr->r->mbm_width);
- m->chunks += chunks;
cur_bw = (chunks * r->mon_scale) >> 20;
if (m->delta_comp)
@@ -450,15 +449,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ __mon_event_count(rmid, &rr);
/*
* Call the MBA software controller only for the
* control groups and when user has enabled
* the software controller explicitly.
*/
- if (!is_mba_sc(NULL))
- __mon_event_count(rmid, &rr);
- else
+ if (is_mba_sc(NULL))
mbm_bw_count(rmid, &rr);
}
}
先生/女士,
我可以获取有关一项合法业务交易建议的非常重要的信息,该提议将使我们双方受益,£15,500,000.00 百万磅。 如果我可以独自完成任务,那么我就不会麻烦与您联系。
最终,我需要一个诚实的外国人在完成这项业务交易中发挥重要作用。
回复此邮件以获取有关此业务交易的更多信息:galvan.johnny(a)outlook.com
问候,
约翰·加尔文先生
-----------------------------------------------
Sir/Madam,
I have access to very vital information for a Legit business transaction proposal that will benefits both of us with the sum of £15,500,000.00 Million Pounds. If it was possible for me to do it alone, I would not have bothered contacting you.
Ultimately, I need an honest foreigner to play an important role in the completion of this business transaction.
Reply to this mail for more information about this business transaction: galvan.johnny(a)outlook.com
Regards,
Mr. John Galvan
It seems to me that most RSEQ membarrier users will expect any
stores done before the membarrier() syscall to be visible to the
target task(s). While this is extremely likely to be true in
practice, nothing actually guarantees it by a strict reading of the
x86 manuals. Rather than providing this guarantee by accident and
potentially causing a problem down the road, just add an explicit
barrier.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
kernel/sched/membarrier.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 5a40b3828ff2..6251d3d12abe 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -168,6 +168,14 @@ static void ipi_mb(void *info)
static void ipi_rseq(void *info)
{
+ /*
+ * Ensure that all stores done by the calling thread are visible
+ * to the current task before the current task resumes. We could
+ * probably optimize this away on most architectures, but by the
+ * time we've already sent an IPI, the cost of the extra smp_mb()
+ * is negligible.
+ */
+ smp_mb();
rseq_preempt(current);
}
--
2.28.0
The JZ4760 has the HPTLB as well, but has a XBurst CPU with a D0 CPUID.
Disable the HPTLB for all XBurst CPUs with a D0 CPUID. In the case where
there is no HPTLB (e.g. for older SoCs), this won't have any side
effect.
Fixes: b02efeb05699 ("MIPS: Ingenic: Disable abandoned HPTLB function.")
Cc: <stable(a)vger.kernel.org> # 5.4
Signed-off-by: Paul Cercueil <paul(a)crapouillou.net>
---
arch/mips/kernel/cpu-probe.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index e6853697a056..31cb9199197c 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -1830,16 +1830,17 @@ static inline void cpu_probe_ingenic(struct cpuinfo_mips *c, unsigned int cpu)
*/
case PRID_COMP_INGENIC_D0:
c->isa_level &= ~MIPS_CPU_ISA_M32R2;
- break;
+ fallthrough;
/*
* The config0 register in the XBurst CPUs with a processor ID of
- * PRID_COMP_INGENIC_D1 has an abandoned huge page tlb mode, this
- * mode is not compatible with the MIPS standard, it will cause
- * tlbmiss and into an infinite loop (line 21 in the tlb-funcs.S)
- * when starting the init process. After chip reset, the default
- * is HPTLB mode, Write 0xa9000000 to cp0 register 5 sel 4 to
- * switch back to VTLB mode to prevent getting stuck.
+ * PRID_COMP_INGENIC_D0 or PRID_COMP_INGENIC_D1 has an abandoned
+ * huge page tlb mode, this mode is not compatible with the MIPS
+ * standard, it will cause tlbmiss and into an infinite loop
+ * (line 21 in the tlb-funcs.S) when starting the init process.
+ * After chip reset, the default is HPTLB mode, Write 0xa9000000
+ * to cp0 register 5 sel 4 to switch back to VTLB mode to prevent
+ * getting stuck.
*/
case PRID_COMP_INGENIC_D1:
write_c0_page_ctrl(XBURST_PAGECTRL_HPTLB_DIS);
--
2.29.2
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e91d8d78237de8d7120c320b3645b7100848f24d Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan(a)kernel.org>
Date: Sat, 5 Dec 2020 22:14:51 -0800
Subject: [PATCH] mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.
BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460
We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.
Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).
Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.
[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky(a)gmail.com>
Cc: Tony Lindgren <tony(a)atomide.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Harish Sriram <harish(a)linux.ibm.com>
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/arch/arm/configs/omap2plus_defconfig b/arch/arm/configs/omap2plus_defconfig
index 34793aabdb65..58df9fd79a76 100644
--- a/arch/arm/configs/omap2plus_defconfig
+++ b/arch/arm/configs/omap2plus_defconfig
@@ -81,7 +81,6 @@ CONFIG_PARTITION_ADVANCED=y
CONFIG_BINFMT_MISC=y
CONFIG_CMA=y
CONFIG_ZSMALLOC=m
-CONFIG_ZSMALLOC_PGTABLE_MAPPING=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 0fdbf653b173..4807ca4d52e0 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -20,7 +20,6 @@
* zsmalloc mapping modes
*
* NOTE: These only make a difference when a mapped object spans pages.
- * They also have no effect when ZSMALLOC_PGTABLE_MAPPING is selected.
*/
enum zs_mapmode {
ZS_MM_RW, /* normal read-write mapping */
diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f884a7..390165ffbb0f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -707,19 +707,6 @@ config ZSMALLOC
returned by an alloc(). This handle must be mapped in order to
access the allocated space.
-config ZSMALLOC_PGTABLE_MAPPING
- bool "Use page table mapping to access object in zsmalloc"
- depends on ZSMALLOC=y
- help
- By default, zsmalloc uses a copy-based object mapping method to
- access allocations that span two pages. However, if a particular
- architecture (ex, ARM) performs VM mapping faster than copying,
- then you should select this. This causes zsmalloc to use page table
- mapping rather than copying for object mapping.
-
- You can check speed with zsmalloc benchmark:
- https://github.com/spartacus06/zsmapbench
-
config ZSMALLOC_STAT
bool "Export zsmalloc statistics"
depends on ZSMALLOC
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 918c7b019b3d..cdfaaadea8ff 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -293,11 +293,7 @@ struct zspage {
};
struct mapping_area {
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
- struct vm_struct *vm; /* vm area for mapping object that span pages */
-#else
char *vm_buf; /* copy buffer for objects that span pages */
-#endif
char *vm_addr; /* address of kmap_atomic()'ed pages */
enum zs_mapmode vm_mm; /* mapping mode */
};
@@ -1113,54 +1109,6 @@ static struct zspage *find_get_zspage(struct size_class *class)
return zspage;
}
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
-static inline int __zs_cpu_up(struct mapping_area *area)
-{
- /*
- * Make sure we don't leak memory if a cpu UP notification
- * and zs_init() race and both call zs_cpu_up() on the same cpu
- */
- if (area->vm)
- return 0;
- area->vm = get_vm_area(PAGE_SIZE * 2, 0);
- if (!area->vm)
- return -ENOMEM;
-
- /*
- * Populate ptes in advance to avoid pte allocation with GFP_KERNEL
- * in non-preemtible context of zs_map_object.
- */
- return apply_to_page_range(&init_mm, (unsigned long)area->vm->addr,
- PAGE_SIZE * 2, NULL, NULL);
-}
-
-static inline void __zs_cpu_down(struct mapping_area *area)
-{
- if (area->vm)
- free_vm_area(area->vm);
- area->vm = NULL;
-}
-
-static inline void *__zs_map_object(struct mapping_area *area,
- struct page *pages[2], int off, int size)
-{
- unsigned long addr = (unsigned long)area->vm->addr;
-
- BUG_ON(map_kernel_range(addr, PAGE_SIZE * 2, PAGE_KERNEL, pages) < 0);
- area->vm_addr = area->vm->addr;
- return area->vm_addr + off;
-}
-
-static inline void __zs_unmap_object(struct mapping_area *area,
- struct page *pages[2], int off, int size)
-{
- unsigned long addr = (unsigned long)area->vm_addr;
-
- unmap_kernel_range(addr, PAGE_SIZE * 2);
-}
-
-#else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
static inline int __zs_cpu_up(struct mapping_area *area)
{
/*
@@ -1241,8 +1189,6 @@ static void __zs_unmap_object(struct mapping_area *area,
pagefault_enable();
}
-#endif /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
static int zs_cpu_prepare(unsigned int cpu)
{
struct mapping_area *area;
From: Nicholas Piggin <npiggin(a)gmail.com>
[ Upstream commit 8ff00399b153440c1c83e20c43020385b416415b ]
powerpc/64s keeps a counter in the mm which counts bits set in
mm_cpumask as well as other things. This means it can't use generic code
to clear bits out of the mask and doesn't adjust the arch specific
counter.
Add an arch override that allows powerpc/64s to use
clear_tasks_mm_cpumask().
Signed-off-by: Nicholas Piggin <npiggin(a)gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/cpu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d8c77bfb6e7e4..e1d10629022a5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -772,6 +772,10 @@ void __init cpuhp_threads_init(void)
}
#ifdef CONFIG_HOTPLUG_CPU
+#ifndef arch_clear_mm_cpumask_cpu
+#define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, mm_cpumask(mm))
+#endif
+
/**
* clear_tasks_mm_cpumask - Safely clear tasks' mm_cpumask for a CPU
* @cpu: a CPU id
@@ -807,7 +811,7 @@ void clear_tasks_mm_cpumask(int cpu)
t = find_lock_task_mm(p);
if (!t)
continue;
- cpumask_clear_cpu(cpu, mm_cpumask(t->mm));
+ arch_clear_mm_cpumask_cpu(cpu, t->mm);
task_unlock(t);
}
rcu_read_unlock();
--
2.27.0