Dear stable team (aka Greg)
Please backport
a04ac8273665 ("drm/i915/gt: Fixup tgl mocs for PTE tracking")
Note that this needs
4d8a5cfe3b13 ("drm/i915/gt: Initialize reserved and unspecified MOCS indices")
but that one has already a cc: stable, unfortunately the bugfix didn't.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
The patch below does not apply to the 5.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 06c5fe9b12dde1b62821f302f177c972bb1c81f9 Mon Sep 17 00:00:00 2001
From: Xiaochen Shen <xiaochen.shen(a)intel.com>
Date: Fri, 4 Dec 2020 14:27:59 +0800
Subject: [PATCH] x86/resctrl: Fix incorrect local bandwidth when mba_sc is
enabled
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
rdtgroup_mondata_show()
mon_event_read()
mon_event_count()
__mon_event_count()
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Before fix:
local b/w (bytes/s): 11076796416
After fix:
local b/w (bytes/s): 5465014272
Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@i…
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 54dffe574e67..a98519a3a2e6 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -279,7 +279,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
return;
chunks = mbm_overflow_count(m->prev_bw_msr, tval, rr->r->mbm_width);
- m->chunks += chunks;
cur_bw = (chunks * r->mon_scale) >> 20;
if (m->delta_comp)
@@ -450,15 +449,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ __mon_event_count(rmid, &rr);
/*
* Call the MBA software controller only for the
* control groups and when user has enabled
* the software controller explicitly.
*/
- if (!is_mba_sc(NULL))
- __mon_event_count(rmid, &rr);
- else
+ if (is_mba_sc(NULL))
mbm_bw_count(rmid, &rr);
}
}
commit 758c9373d84168dc7d039cf85a0e920046b17b41 upstream
membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync. On x86,
this may be true in practice, but it's not architecturally reliable. In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing. While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall. The new task could then SYSRET back to usermode
without ever executing IRET.
Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores. (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.16070583…
---
My stable membarrier series depends on commit 2a36ab717e8f
("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ"). I don't
think it makes much sense to backport that feature, so here's a backport of
the patch that doesn't need it.
kernel/sched/membarrier.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..be0ca3306be8 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -30,6 +30,23 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}
+static void ipi_sync_core(void *info)
+{
+ /*
+ * The smp_mb() in membarrier after all the IPIs is supposed to
+ * ensure that memory on remote CPUs that occur before the IPI
+ * become visible to membarrier()'s caller -- see scenario B in
+ * the big comment at the top of this file.
+ *
+ * A sync_core() would provide this guarantee, but
+ * sync_core_before_usermode() might end up being deferred until
+ * after membarrier()'s smp_mb().
+ */
+ smp_mb(); /* IPIs should be serializing but paranoid. */
+
+ sync_core_before_usermode();
+}
+
static void ipi_sync_rq_state(void *info)
{
struct mm_struct *mm = (struct mm_struct *) info;
@@ -134,6 +151,7 @@ static int membarrier_private_expedited(int flags)
int cpu;
cpumask_var_t tmpmask;
struct mm_struct *mm = current->mm;
+ smp_call_func_t ipi_func = ipi_mb;
if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
@@ -141,6 +159,7 @@ static int membarrier_private_expedited(int flags)
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
+ ipi_func = ipi_sync_core;
} else {
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
@@ -181,7 +200,7 @@ static int membarrier_private_expedited(int flags)
rcu_read_unlock();
preempt_disable();
- smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ smp_call_function_many(tmpmask, ipi_func, NULL, 1);
preempt_enable();
free_cpumask_var(tmpmask);
--
2.29.2
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ec9d78070de986ecf581ea204fd322af4d2477ec Mon Sep 17 00:00:00 2001
From: Fangrui Song <maskray(a)google.com>
Date: Thu, 29 Oct 2020 11:19:51 -0700
Subject: [PATCH] arm64: Change .weak to SYM_FUNC_START_WEAK_PI for
arch/arm64/lib/mem*.S
Commit 39d114ddc682 ("arm64: add KASAN support") added .weak directives to
arch/arm64/lib/mem*.S instead of changing the existing SYM_FUNC_START_PI
macros. This can lead to the assembly snippet `.weak memcpy ... .globl
memcpy` which will produce a STB_WEAK memcpy with GNU as but STB_GLOBAL
memcpy with LLVM's integrated assembler before LLVM 12. LLVM 12 (since
https://reviews.llvm.org/D90108) will error on such an overridden symbol
binding.
Use the appropriate SYM_FUNC_START_WEAK_PI instead.
Fixes: 39d114ddc682 ("arm64: add KASAN support")
Reported-by: Sami Tolvanen <samitolvanen(a)google.com>
Signed-off-by: Fangrui Song <maskray(a)google.com>
Tested-by: Sami Tolvanen <samitolvanen(a)google.com>
Tested-by: Nick Desaulniers <ndesaulniers(a)google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20201029181951.1866093-1-maskray@google.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
index e0bf83d556f2..dc8d2a216a6e 100644
--- a/arch/arm64/lib/memcpy.S
+++ b/arch/arm64/lib/memcpy.S
@@ -56,9 +56,8 @@
stp \reg1, \reg2, [\ptr], \val
.endm
- .weak memcpy
SYM_FUNC_START_ALIAS(__memcpy)
-SYM_FUNC_START_PI(memcpy)
+SYM_FUNC_START_WEAK_PI(memcpy)
#include "copy_template.S"
ret
SYM_FUNC_END_PI(memcpy)
diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S
index 02cda2e33bde..1035dce4bdaf 100644
--- a/arch/arm64/lib/memmove.S
+++ b/arch/arm64/lib/memmove.S
@@ -45,9 +45,8 @@ C_h .req x12
D_l .req x13
D_h .req x14
- .weak memmove
SYM_FUNC_START_ALIAS(__memmove)
-SYM_FUNC_START_PI(memmove)
+SYM_FUNC_START_WEAK_PI(memmove)
cmp dstin, src
b.lo __memcpy
add tmp1, src, count
diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
index 77c3c7ba0084..a9c1c9a01ea9 100644
--- a/arch/arm64/lib/memset.S
+++ b/arch/arm64/lib/memset.S
@@ -42,9 +42,8 @@ dst .req x8
tmp3w .req w9
tmp3 .req x9
- .weak memset
SYM_FUNC_START_ALIAS(__memset)
-SYM_FUNC_START_PI(memset)
+SYM_FUNC_START_WEAK_PI(memset)
mov dst, dstin /* Preserve return value. */
and A_lw, val, #255
orr A_lw, A_lw, A_lw, lsl #8
Dear stable kernel maintainers,
Please consider applying the following backports of commit
e0d5896bd356 ("arm64: lse: fix LSE atomics with LLVM's integrated
assembler") which first landed in v5.6-rc1 and was already picked up
into linux-5.4.y as f68668292496 in v5.4.22 (adjusted for a conflict
due to commit addfc38672c7 ("arm64: atomics: avoid out-of-line ll/sc
atomics") which landed in v5.4-rc1).
Also contains a fix for that first patch which cherry-picks cleanly,
commit dd1f6308b28e ("arm64: lse: Fix LSE atomics with LLVM").
The attached patches allow for Android and CrOS to build with
LLVM_IAS=1 for arm64 for v4.19.y (modulo one small patch that I will
send tomorrow).
--
Thanks,
~Nick Desaulniers
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a34a0a632dd991a371fec56431d73279f9c54029 Mon Sep 17 00:00:00 2001
From: Xin Xiong <xiongx18(a)fudan.edu.cn>
Date: Sun, 19 Jul 2020 23:45:45 +0800
Subject: [PATCH] drm: fix drm_dp_mst_port refcount leaks in
drm_dp_mst_allocate_vcpi
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
drm_dp_mst_allocate_vcpi() invokes
drm_dp_mst_topology_get_port_validated(), which increases the refcount
of the "port".
These reference counting issues take place in two exception handling
paths separately. Either when “slots” is less than 0 or when
drm_dp_init_vcpi() returns a negative value, the function forgets to
reduce the refcnt increased drm_dp_mst_topology_get_port_validated(),
which results in a refcount leak.
Fix these issues by pulling up the error handling when "slots" is less
than 0, and calling drm_dp_mst_topology_put_port() before termination
when drm_dp_init_vcpi() returns a negative value.
Fixes: 1e797f556c61 ("drm/dp: Split drm_dp_mst_allocate_vcpi")
Cc: <stable(a)vger.kernel.org> # v4.12+
Signed-off-by: Xiyu Yang <xiyuyang19(a)fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf(a)gmail.com>
Signed-off-by: Xin Xiong <xiongx18(a)fudan.edu.cn>
Reviewed-by: Lyude Paul <lyude(a)redhat.com>
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200719154545.GA41231@xin-vi…
diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c
index 09b32289497e..b23cb2fec3f3 100644
--- a/drivers/gpu/drm/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/drm_dp_mst_topology.c
@@ -4308,11 +4308,11 @@ bool drm_dp_mst_allocate_vcpi(struct drm_dp_mst_topology_mgr *mgr,
{
int ret;
- port = drm_dp_mst_topology_get_port_validated(mgr, port);
- if (!port)
+ if (slots < 0)
return false;
- if (slots < 0)
+ port = drm_dp_mst_topology_get_port_validated(mgr, port);
+ if (!port)
return false;
if (port->vcpi.vcpi > 0) {
@@ -4328,6 +4328,7 @@ bool drm_dp_mst_allocate_vcpi(struct drm_dp_mst_topology_mgr *mgr,
if (ret) {
DRM_DEBUG_KMS("failed to init vcpi slots=%d max=63 ret=%d\n",
DIV_ROUND_UP(pbn, mgr->pbn_div), ret);
+ drm_dp_mst_topology_put_port(port);
goto out;
}
DRM_DEBUG_KMS("initing vcpi for pbn=%d slots=%d\n",
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4387b3dbb079d482d3c2b43a703ceed4dd27ed28 Mon Sep 17 00:00:00 2001
From: Brant Merryman <brant.merryman(a)silabs.com>
Date: Fri, 26 Jun 2020 04:22:58 +0000
Subject: [PATCH] USB: serial: cp210x: enable usb generic throttle/unthrottle
Assign the .throttle and .unthrottle functions to be generic function
in the driver structure to prevent data loss that can otherwise occur
if the host does not enable USB throttling.
Signed-off-by: Brant Merryman <brant.merryman(a)silabs.com>
Co-developed-by: Phu Luu <phu.luu(a)silabs.com>
Signed-off-by: Phu Luu <phu.luu(a)silabs.com>
Link: https://lore.kernel.org/r/57401AF3-9961-461F-95E1-F8AFC2105F5E@silabs.com
[ johan: fix up tags ]
Fixes: 39a66b8d22a3 ("[PATCH] USB: CP2101 Add support for flow control")
Cc: stable <stable(a)vger.kernel.org> # 2.6.12
Signed-off-by: Johan Hovold <johan(a)kernel.org>
diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index f5143eedbc48..bcceb4ad8be0 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -272,6 +272,8 @@ static struct usb_serial_driver cp210x_device = {
.break_ctl = cp210x_break_ctl,
.set_termios = cp210x_set_termios,
.tx_empty = cp210x_tx_empty,
+ .throttle = usb_serial_generic_throttle,
+ .unthrottle = usb_serial_generic_unthrottle,
.tiocmget = cp210x_tiocmget,
.tiocmset = cp210x_tiocmset,
.attach = cp210x_attach,
From: Johannes Weiner <hannes(a)cmpxchg.org>
[ Upstream commit a983b5ebee57209c99f68c8327072f25e0e6e3da ]
mm: memcontrol: fix excessive complexity in memory.stat reporting
We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.
Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy. The complexity is this:
nr_cgroups * nr_stat_items * nr_possible_cpus
where the stat items are ~70 at this point. With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters. This doesn't scale.
Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.
This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.
Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error. That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is
true for the memory controller.
Use the same static batch size we already use for page_counter updates
during charging. The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.
[hannes(a)cmpxchg.org: fix warning in __this_cpu_xchg() calls]
Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: stable(a)vger.kernel.org c9019e9: mm: memcontrol: eliminate raw access to stat and event counters
Cc: stable(a)vger.kernel.org 2845426: mm: memcontrol: implement lruvec stat functions on top of each other
Cc: stable(a)vger.kernel.org
[shaoyi(a)amazon.com: resolved the conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 in mm/memcontrol.c by contextual fix]
Signed-off-by: Shaoying Xu <shaoyi(a)amazon.com>
---
The excessive complexity in memory.stat reporting was fixed in v4.16 but didn't appear to make it to 4.14 stable. When backporting this patch, there is a small conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 within free_mem_cgroup_per_node_info() of mm/memcontrol.c and can be resolved by contextual fix.
include/linux/memcontrol.h | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------
mm/memcontrol.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------
2 files changed, 113 insertions(+), 84 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1ffc54ac4cc9..882046863581 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -108,7 +108,10 @@ struct lruvec_stat {
*/
struct mem_cgroup_per_node {
struct lruvec lruvec;
- struct lruvec_stat __percpu *lruvec_stat;
+
+ struct lruvec_stat __percpu *lruvec_stat_cpu;
+ atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
+
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
@@ -227,10 +230,10 @@ struct mem_cgroup {
spinlock_t move_lock;
struct task_struct *move_lock_task;
unsigned long move_lock_flags;
- /*
- * percpu counter.
- */
- struct mem_cgroup_stat_cpu __percpu *stat;
+
+ struct mem_cgroup_stat_cpu __percpu *stat_cpu;
+ atomic_long_t stat[MEMCG_NR_STAT];
+ atomic_long_t events[MEMCG_NR_EVENTS];
unsigned long socket_pressure;
@@ -265,6 +268,12 @@ struct mem_cgroup {
/* WARNING: nodeinfo must be the last member here */
};
+/*
+ * size of first charge trial. "32" comes from vmscan.c's magic value.
+ * TODO: maybe necessary to use big numbers in big irons.
+ */
+#define MEMCG_CHARGE_BATCH 32U
+
extern struct mem_cgroup *root_mem_cgroup;
static inline bool mem_cgroup_disabled(void)
@@ -485,32 +494,38 @@ void unlock_page_memcg(struct page *page);
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
int idx)
{
- long val = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- val += per_cpu(memcg->stat->count[idx], cpu);
-
- if (val < 0)
- val = 0;
-
- return val;
+ long x = atomic_long_read(&memcg->stat[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void __mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
{
- if (!mem_cgroup_disabled())
- __this_cpu_add(memcg->stat->count[idx], val);
+ long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = val + __this_cpu_read(memcg->stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->stat_cpu->count[idx], x);
}
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
{
- if (!mem_cgroup_disabled())
- this_cpu_add(memcg->stat->count[idx], val);
+ preempt_disable();
+ __mod_memcg_state(memcg, idx, val);
+ preempt_enable();
}
/**
@@ -548,26 +563,25 @@ static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long val = 0;
- int cpu;
+ long x;
if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- for_each_possible_cpu(cpu)
- val += per_cpu(pn->lruvec_stat->count[idx], cpu);
-
- if (val < 0)
- val = 0;
-
- return val;
+ x = atomic_long_read(&pn->lruvec_stat[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline void __mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
struct mem_cgroup_per_node *pn;
+ long x;
/* Update node */
__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
@@ -581,7 +595,12 @@ static inline void __mod_lruvec_state(struct lruvec *lruvec,
__mod_memcg_state(pn->memcg, idx, val);
/* Update lruvec */
- __this_cpu_add(pn->lruvec_stat->count[idx], val);
+ x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &pn->lruvec_stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
}
static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -624,16 +643,25 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
static inline void __count_memcg_events(struct mem_cgroup *memcg,
int idx, unsigned long count)
{
- if (!mem_cgroup_disabled())
- __this_cpu_add(memcg->stat->events[idx], count);
+ unsigned long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = count + __this_cpu_read(memcg->stat_cpu->events[idx]);
+ if (unlikely(x > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->events[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->stat_cpu->events[idx], x);
}
-/* idx can be of type enum memcg_event_item or vm_event_item */
static inline void count_memcg_events(struct mem_cgroup *memcg,
int idx, unsigned long count)
{
- if (!mem_cgroup_disabled())
- this_cpu_add(memcg->stat->events[idx], count);
+ preempt_disable();
+ __count_memcg_events(memcg, idx, count);
+ preempt_enable();
}
/* idx can be of type enum memcg_event_item or vm_event_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eba9dc4795b5..4e763cdccb33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -542,39 +542,10 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}
-/*
- * Return page count for single (non recursive) @memcg.
- *
- * Implementation Note: reading percpu statistics for memcg.
- *
- * Both of vmstat[] and percpu_counter has threshold and do periodic
- * synchronization to implement "quick" read. There are trade-off between
- * reading cost and precision of value. Then, we may have a chance to implement
- * a periodic synchronization of counter in memcg's counter.
- *
- * But this _read() function is used for user interface now. The user accounts
- * memory usage by memory cgroup and he _always_ requires exact value because
- * he accounts memory. Even if we provide quick-and-fuzzy read, we always
- * have to visit all online cpus and make sum. So, for now, unnecessary
- * synchronization is not implemented. (just implemented for cpu hotplug)
- *
- * If there are kernel internal actions which can make use of some not-exact
- * value, and reading all cpu value can be performance bottleneck in some
- * common workload, threshold and synchronization as vmstat[] should be
- * implemented.
- *
- * The parameter idx can be of type enum memcg_event_item or vm_event_item.
- */
-
static unsigned long memcg_sum_events(struct mem_cgroup *memcg,
int event)
{
- unsigned long val = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- val += per_cpu(memcg->stat->events[event], cpu);
- return val;
+ return atomic_long_read(&memcg->events[event]);
}
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -606,7 +577,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
nr_pages = -nr_pages; /* for event */
}
- __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+ __this_cpu_add(memcg->stat_cpu->nr_page_events, nr_pages);
}
unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
@@ -642,8 +613,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
{
unsigned long val, next;
- val = __this_cpu_read(memcg->stat->nr_page_events);
- next = __this_cpu_read(memcg->stat->targets[target]);
+ val = __this_cpu_read(memcg->stat_cpu->nr_page_events);
+ next = __this_cpu_read(memcg->stat_cpu->targets[target]);
/* from time_after() in jiffies.h */
if ((long)(next - val) < 0) {
switch (target) {
@@ -659,7 +630,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
default:
break;
}
- __this_cpu_write(memcg->stat->targets[target], next);
+ __this_cpu_write(memcg->stat_cpu->targets[target], next);
return true;
}
return false;
@@ -1726,11 +1697,6 @@ void unlock_page_memcg(struct page *page)
}
EXPORT_SYMBOL(unlock_page_memcg);
-/*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
- */
-#define CHARGE_BATCH 32U
struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
@@ -1758,7 +1724,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
unsigned long flags;
bool ret = false;
- if (nr_pages > CHARGE_BATCH)
+ if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
local_irq_save(flags);
@@ -1827,7 +1793,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
}
stock->nr_pages += nr_pages;
- if (stock->nr_pages > CHARGE_BATCH)
+ if (stock->nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
local_irq_restore(flags);
@@ -1877,9 +1843,44 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_stock_pcp *stock;
+ struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
+
+ for_each_mem_cgroup(memcg) {
+ int i;
+
+ for (i = 0; i < MEMCG_NR_STAT; i++) {
+ int nid;
+ long x;
+
+ x = this_cpu_xchg(memcg->stat_cpu->count[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->stat[i]);
+
+ if (i >= NR_VM_NODE_STAT_ITEMS)
+ continue;
+
+ for_each_node(nid) {
+ struct mem_cgroup_per_node *pn;
+
+ pn = mem_cgroup_nodeinfo(memcg, nid);
+ x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
+ if (x)
+ atomic_long_add(x, &pn->lruvec_stat[i]);
+ }
+ }
+
+ for (i = 0; i < MEMCG_NR_EVENTS; i++) {
+ long x;
+
+ x = this_cpu_xchg(memcg->stat_cpu->events[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->events[i]);
+ }
+ }
+
return 0;
}
@@ -1900,7 +1901,7 @@ static void high_work_func(struct work_struct *work)
struct mem_cgroup *memcg;
memcg = container_of(work, struct mem_cgroup, high_work);
- reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+ reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}
/*
@@ -1924,7 +1925,7 @@ void mem_cgroup_handle_over_high(void)
static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
- unsigned int batch = max(CHARGE_BATCH, nr_pages);
+ unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
struct page_counter *counter;
@@ -4203,8 +4204,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;
- pn->lruvec_stat = alloc_percpu(struct lruvec_stat);
- if (!pn->lruvec_stat) {
+ pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+ if (!pn->lruvec_stat_cpu) {
kfree(pn);
return 1;
}
@@ -4225,7 +4226,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return;
- free_percpu(pn->lruvec_stat);
+ free_percpu(pn->lruvec_stat_cpu);
kfree(pn);
}
@@ -4235,7 +4236,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
- free_percpu(memcg->stat);
+ free_percpu(memcg->stat_cpu);
kfree(memcg);
}
@@ -4264,8 +4265,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg->id.id < 0)
goto fail;
- memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
- if (!memcg->stat)
+ memcg->stat_cpu = alloc_percpu(struct mem_cgroup_stat_cpu);
+ if (!memcg->stat_cpu)
goto fail;
for_each_node(node)
@@ -5686,7 +5687,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+ __this_cpu_add(ug->memcg->stat_cpu->nr_page_events, nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 34c0f6f2695a2db81e09a3ab7bdb2853f45d4d3d Mon Sep 17 00:00:00 2001
From: "Maciej S. Szmigiero" <maciej.szmigiero(a)oracle.com>
Date: Sat, 5 Dec 2020 01:48:08 +0100
Subject: [PATCH] KVM: mmu: Fix SPTE encoding of MMIO generation upper half
Commit cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
cleaned up the computation of MMIO generation SPTE masks, however it
introduced a bug how the upper part was encoded:
SPTE bits 52-61 were supposed to contain bits 10-19 of the current
generation number, however a missing shift encoded bits 1-10 there instead
(mostly duplicating the lower part of the encoded generation number that
then consisted of bits 1-9).
In the meantime, the upper part was shrunk by one bit and moved by
subsequent commits to become an upper half of the encoded generation number
(bits 9-17 of bits 0-17 encoded in a SPTE).
In addition to the above, commit 56871d444bc4 ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
has changed the SPTE bit range assigned to encode the generation number and
the total number of bits encoded but did not update them in the comment
attached to their defines, nor in the KVM MMU doc.
Let's do it here, too, since it is too trivial thing to warrant a separate
commit.
Fixes: cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero(a)oracle.com>
Message-Id: <156700708db2a5296c5ed7a8b9ac71f1e9765c85.1607129096.git.maciej.szmigiero(a)oracle.com>
Cc: stable(a)vger.kernel.org
[Reorganize macros so that everything is computed from the bit ranges. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 1c030dbac7c4..5bfe28b0728e 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -455,7 +455,7 @@ If the generation number of the spte does not equal the global generation
number, it will ignore the cached MMIO information and handle the page
fault through the slow path.
-Since only 19 bits are used to store generation-number on mmio spte, all
+Since only 18 bits are used to store generation-number on mmio spte, all
pages are zapped when there is an overflow.
Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fcac2cac78fe..c51ad544f25b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -40,8 +40,8 @@ static u64 generation_mmio_spte_mask(u64 gen)
WARN_ON(gen & ~MMIO_SPTE_GEN_MASK);
BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK);
- mask = (gen << MMIO_SPTE_GEN_LOW_START) & MMIO_SPTE_GEN_LOW_MASK;
- mask |= (gen << MMIO_SPTE_GEN_HIGH_START) & MMIO_SPTE_GEN_HIGH_MASK;
+ mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
+ mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
return mask;
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5c75a451c000..2b3a30bd38b0 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -56,11 +56,11 @@
#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
/*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
* the memslots generation and is derived as follows:
*
* Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-18 of the MMIO generation are propagated to spte bits 52-61
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -69,18 +69,29 @@
* requires a full MMU zap). The flag is instead explicitly queried when
* checking for MMIO spte cache hits.
*/
-#define MMIO_SPTE_GEN_MASK GENMASK_ULL(17, 0)
#define MMIO_SPTE_GEN_LOW_START 3
#define MMIO_SPTE_GEN_LOW_END 11
-#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
- MMIO_SPTE_GEN_LOW_START)
#define MMIO_SPTE_GEN_HIGH_START PT64_SECOND_AVAIL_BITS_SHIFT
#define MMIO_SPTE_GEN_HIGH_END 62
+
+#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
+ MMIO_SPTE_GEN_LOW_START)
#define MMIO_SPTE_GEN_HIGH_MASK GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
MMIO_SPTE_GEN_HIGH_START)
+#define MMIO_SPTE_GEN_LOW_BITS (MMIO_SPTE_GEN_LOW_END - MMIO_SPTE_GEN_LOW_START + 1)
+#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
+
+/* remember to adjust the comment above as well if you change these */
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 9 && MMIO_SPTE_GEN_HIGH_BITS == 9);
+
+#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
+#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
+
+#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
+
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
@@ -228,8 +239,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
{
u64 gen;
- gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_START;
- gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_START;
+ gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_SHIFT;
+ gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_SHIFT;
return gen;
}
The patch titled
Subject: mm/hugetlb: fix deadlock in hugetlb_cow error path
has been added to the -mm tree. Its filename is
mm-hugetlb-fix-deadlock-in-hugetlb_cow-error-path.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-deadlock-in-hugetl…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-deadlock-in-hugetl…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: mm/hugetlb: fix deadlock in hugetlb_cow error path
syzbot reported the deadlock here [1]. The issue is in hugetlb cow error
handling when there are not enough huge pages for the faulting task which
took the original reservation. It is possible that other (child) tasks
could have consumed pages associated with the reservation. In this case,
we want the task which took the original reservation to succeed. So, we
unmap any associated pages in children so that they can be used by the
faulting task that owns the reservation.
The unmapping code needs to hold i_mmap_rwsem in write mode. However, due
to commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") we are already holding i_mmap_rwsem in read mode when
hugetlb_cow is called. Technically, i_mmap_rwsem does not need to be held
in read mode for COW mappings as they can not share pmd's. Modifying the
fault code to not take i_mmap_rwsem in read mode for COW (and other
non-sharable) mappings is too involved for a stable fix. Instead, we
simply drop the hugetlb_fault_mutex and i_mmap_rwsem before unmapping.
This is OK as it is technically not needed. They are reacquired after
unmapping as expected by calling code. Since this is done in an uncommon
error path, the overhead of dropping and reacquiring mutexes is
acceptable.
While making changes, remove redundant BUG_ON after unmap_ref_private.
[1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com
Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reported-by: syzbot+5eee4145df3c15e96625(a)syzkaller.appspotmail.com
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-fix-deadlock-in-hugetlb_cow-error-path
+++ a/mm/hugetlb.c
@@ -4105,10 +4105,30 @@ retry_avoidcopy:
* may get SIGKILLed if it later faults.
*/
if (outside_reserve) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t idx;
+ u32 hash;
+
put_page(old_page);
BUG_ON(huge_pte_none(pte));
+ /*
+ * Drop hugetlb_fault_mutex and i_mmap_rwsem before
+ * unmapping. unmapping needs to hold i_mmap_rwsem
+ * in write mode. Dropping i_mmap_rwsem in read mode
+ * here is OK as COW mappings do not interact with
+ * PMD sharing.
+ *
+ * Reacquire both after unmap operation.
+ */
+ idx = vma_hugecache_offset(h, vma, haddr);
+ hash = hugetlb_fault_mutex_hash(mapping, idx);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
unmap_ref_private(mm, vma, old_page, haddr);
- BUG_ON(huge_pte_none(pte));
+
+ i_mmap_lock_read(mapping);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
mm-hugetlb-fix-deadlock-in-hugetlb_cow-error-path.patch
The patch titled
Subject: lib/zlib: fix inflating zlib streams on s390
has been added to the -mm tree. Its filename is
lib-zlib-fix-inflating-zlib-streams-on-s390.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/lib-zlib-fix-inflating-zlib-strea…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/lib-zlib-fix-inflating-zlib-strea…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Ilya Leoshkevich <iii(a)linux.ibm.com>
Subject: lib/zlib: fix inflating zlib streams on s390
Decompressing zlib streams on s390 fails with "incorrect data check"
error.
Userspace zlib checks inflate_state.flags in order to byteswap checksums
only for zlib streams, and s390 hardware inflate code, which was ported
from there, tries to match this behavior. At the same time, kernel zlib
does not use inflate_state.flags, so it contains essentially random
values. For many use cases either zlib stream is zeroed out or checksum
is not used, so this problem is masked, but at least SquashFS is still
affected.
Fix by always passing a checksum to and from the hardware as is, which
matches zlib_inflate()'s expectations.
Link: https://lkml.kernel.org/r/20201215155551.894884-1-iii@linux.ibm.com
Fixes: 126196100063 ("lib/zlib: add s390 hardware support for kernel zlib_inflate")
Signed-off-by: Ilya Leoshkevich <iii(a)linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Acked-by: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [5.6+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/zlib_dfltcc/dfltcc_inflate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/lib/zlib_dfltcc/dfltcc_inflate.c~lib-zlib-fix-inflating-zlib-streams-on-s390
+++ a/lib/zlib_dfltcc/dfltcc_inflate.c
@@ -125,7 +125,7 @@ dfltcc_inflate_action dfltcc_inflate(
param->ho = (state->write - state->whave) & ((1 << HB_BITS) - 1);
if (param->hl)
param->nt = 0; /* Honor history for the first block */
- param->cv = state->flags ? REVERSE(state->check) : state->check;
+ param->cv = state->check;
/* Inflate */
do {
@@ -138,7 +138,7 @@ dfltcc_inflate_action dfltcc_inflate(
state->bits = param->sbb;
state->whave = param->hl;
state->write = (param->ho + param->hl) & ((1 << HB_BITS) - 1);
- state->check = state->flags ? REVERSE(param->cv) : param->cv;
+ state->check = param->cv;
if (cc == DFLTCC_CC_OP2_CORRUPT && param->oesc != 0) {
/* Report an error if stream is corrupted */
state->mode = BAD;
_
Patches currently in -mm which might be from iii(a)linux.ibm.com are
lib-zlib-fix-inflating-zlib-streams-on-s390.patch
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
The logic for truncating the log file for emailing based on the
MAIL_MAX_SIZE option is confusing and incorrect. Simplify it and have the
tail of the log file truncated to the max size specified in the config.
Cc: stable(a)vger.kernel.org
Fixes: 855d8abd2e8ff ("ktest.pl: Change the logic to control the size of the log file emailed")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
tools/testing/ktest/ktest.pl | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/tools/testing/ktest/ktest.pl b/tools/testing/ktest/ktest.pl
index 54f7d008e840..4e2450964517 100755
--- a/tools/testing/ktest/ktest.pl
+++ b/tools/testing/ktest/ktest.pl
@@ -1499,17 +1499,16 @@ sub dodie {
my $log_file;
if (defined($opt{"LOG_FILE"})) {
- my $whence = 0; # beginning of file
- my $pos = $test_log_start;
+ my $whence = 2; # End of file
+ my $log_size = tell LOG;
+ my $size = $log_size - $test_log_start;
if (defined($mail_max_size)) {
- my $log_size = tell LOG;
- $log_size -= $test_log_start;
- if ($log_size > $mail_max_size) {
- $whence = 2; # end of file
- $pos = - $mail_max_size;
+ if ($size > $mail_max_size) {
+ $size = $mail_max_size;
}
}
+ my $pos = - $size;
$log_file = "$tmpdir/log";
open (L, "$opt{LOG_FILE}") or die "Can't open $opt{LOG_FILE} to read)";
open (O, "> $tmpdir/log") or die "Can't open $tmpdir/log\n";
--
2.29.2
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
If the size of the error log is too big to send via email, and the sending
fails, it wont email any result. This can be confusing for the user who is
waiting for an email on the completion of the tests.
If it fails to send email, then try again without the log file stating that
it failed to send an email. Obviously this will not be of use if the sending
of email failed for some other reasons, but it will at least give the user
some information when it fails for the most common reason.
Cc: stable(a)vger.kernel.org
Fixes: c2d84ddb338c8 ("ktest.pl: Add MAIL_COMMAND option to define how to send email")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
tools/testing/ktest/ktest.pl | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/ktest/ktest.pl b/tools/testing/ktest/ktest.pl
index 54188ee16c48..54f7d008e840 100755
--- a/tools/testing/ktest/ktest.pl
+++ b/tools/testing/ktest/ktest.pl
@@ -4253,7 +4253,12 @@ sub do_send_mail {
$mail_command =~ s/\$SUBJECT/$subject/g;
$mail_command =~ s/\$MESSAGE/$message/g;
- run_command $mail_command;
+ my $ret = run_command $mail_command;
+ if (!$ret && defined($file)) {
+ # try again without the file
+ $message .= "\n\n*** FAILED TO SEND LOG ***\n\n";
+ do_send_email($subject, $message);
+ }
}
sub send_email {
--
2.29.2
Since we allow removing the timeline map at runtime, there is a risk
that rq->hwsp points into a stale page. To control that risk, we hold
the RCU read lock while reading *rq->hwsp, but we missed a couple of
important barriers. First, the unpinning / removal of the timeline map
must be after all RCU readers into that map are complete, i.e. after an
rcu barrier (in this case courtesy of call_rcu()). Secondly, we must
make sure that the rq->hwsp we are about to dereference under the RCU
lock is valid. In this case, we make the rq->hwsp pointer safe during
i915_request_retire() and so we know that rq->hwsp may become invalid
only after the request has been signaled. Therefore is the request is
not yet signaled when we acquire rq->hwsp under the RCU, we know that
rq->hwsp will remain valid for the duration of the RCU read lock.
This is a very small window that may lead to either considering the
request not completed (causing a delay until the request is checked
again, any wait for the request is not affected) or dereferencing an
invalid pointer.
Fixes: 3adac4689f58 ("drm/i915: Introduce concept of per-timeline (context) HWSP")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v5.1+
---
drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 11 ++----
drivers/gpu/drm/i915/gt/intel_timeline.c | 6 ++--
drivers/gpu/drm/i915/i915_request.h | 37 ++++++++++++++++++---
3 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 3c62fd6daa76..f96cd7d9b419 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -134,11 +134,6 @@ static bool remove_signaling_context(struct intel_breadcrumbs *b,
return true;
}
-static inline bool __request_completed(const struct i915_request *rq)
-{
- return i915_seqno_passed(__hwsp_seqno(rq), rq->fence.seqno);
-}
-
__maybe_unused static bool
check_signal_order(struct intel_context *ce, struct i915_request *rq)
{
@@ -245,7 +240,7 @@ static void signal_irq_work(struct irq_work *work)
list_for_each_entry_rcu(rq, &ce->signals, signal_link) {
bool release;
- if (!__request_completed(rq))
+ if (!__i915_request_is_complete(rq))
break;
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL,
@@ -380,7 +375,7 @@ static void insert_breadcrumb(struct i915_request *rq)
* straight onto a signaled list, and queue the irq worker for
* its signal completion.
*/
- if (__request_completed(rq)) {
+ if (__i915_request_is_complete(rq)) {
irq_signal_request(rq, b);
return;
}
@@ -468,7 +463,7 @@ void i915_request_cancel_breadcrumb(struct i915_request *rq)
if (release)
intel_context_put(ce);
- if (__request_completed(rq))
+ if (__i915_request_is_complete(rq))
irq_signal_request(rq, b);
i915_request_put(rq);
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 512afacd2bdc..a0ce2fb8737a 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -126,6 +126,10 @@ static void __rcu_cacheline_free(struct rcu_head *rcu)
struct intel_timeline_cacheline *cl =
container_of(rcu, typeof(*cl), rcu);
+ /* Must wait until after all *rq->hwsp are complete before removing */
+ i915_gem_object_unpin_map(cl->hwsp->vma->obj);
+ i915_vma_put(cl->hwsp->vma);
+
i915_active_fini(&cl->active);
kfree(cl);
}
@@ -134,8 +138,6 @@ static void __idle_cacheline_free(struct intel_timeline_cacheline *cl)
{
GEM_BUG_ON(!i915_active_is_idle(&cl->active));
- i915_gem_object_unpin_map(cl->hwsp->vma->obj);
- i915_vma_put(cl->hwsp->vma);
__idle_hwsp_free(cl->hwsp, ptr_unmask_bits(cl->vaddr, CACHELINE_BITS));
call_rcu(&cl->rcu, __rcu_cacheline_free);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 92e4320c50c4..7c4453e60323 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -440,7 +440,7 @@ static inline u32 hwsp_seqno(const struct i915_request *rq)
static inline bool __i915_request_has_started(const struct i915_request *rq)
{
- return i915_seqno_passed(hwsp_seqno(rq), rq->fence.seqno - 1);
+ return i915_seqno_passed(__hwsp_seqno(rq), rq->fence.seqno - 1);
}
/**
@@ -471,11 +471,19 @@ static inline bool __i915_request_has_started(const struct i915_request *rq)
*/
static inline bool i915_request_started(const struct i915_request *rq)
{
+ bool result;
+
if (i915_request_signaled(rq))
return true;
- /* Remember: started but may have since been preempted! */
- return __i915_request_has_started(rq);
+ result = true;
+ rcu_read_lock(); /* the HWSP may be freed at runtime */
+ if (likely(!i915_request_signaled(rq)))
+ /* Remember: started but may have since been preempted! */
+ result = __i915_request_has_started(rq);
+ rcu_read_unlock();
+
+ return result;
}
/**
@@ -488,10 +496,16 @@ static inline bool i915_request_started(const struct i915_request *rq)
*/
static inline bool i915_request_is_running(const struct i915_request *rq)
{
+ bool result;
+
if (!i915_request_is_active(rq))
return false;
- return __i915_request_has_started(rq);
+ rcu_read_lock();
+ result = __i915_request_has_started(rq) && i915_request_is_active(rq);
+ rcu_read_unlock();
+
+ return result;
}
/**
@@ -515,12 +529,25 @@ static inline bool i915_request_is_ready(const struct i915_request *rq)
return !list_empty(&rq->sched.link);
}
+static inline bool __i915_request_is_complete(const struct i915_request *rq)
+{
+ return i915_seqno_passed(__hwsp_seqno(rq), rq->fence.seqno);
+}
+
static inline bool i915_request_completed(const struct i915_request *rq)
{
+ bool result;
+
if (i915_request_signaled(rq))
return true;
- return i915_seqno_passed(hwsp_seqno(rq), rq->fence.seqno);
+ result = true;
+ rcu_read_lock(); /* the HWSP may be freed at runtime */
+ if (likely(!i915_request_signaled(rq)))
+ result = __i915_request_is_complete(rq);
+ rcu_read_unlock();
+
+ return result;
}
static inline void i915_request_mark_complete(struct i915_request *rq)
--
2.20.1
The purpose of io_uring_cancel_files() is to wait for all requests
matching ->files to go/be cancelled. We should first drop files of a
request in io_req_drop_files() and only then make it undiscoverable for
io_uring_cancel_files.
First drop, then delete from list. It's ok to leave req->id->files
dangling, because it's not dereferenced by cancellation code, only
compared against. It would potentially go to sleep and be awaken by
following in io_req_drop_files() wake_up().
Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references")
Cc: <stable(a)vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
---
fs/io_uring.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8cf6f22afc5e..b74957856e68 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6098,15 +6098,15 @@ static void io_req_drop_files(struct io_kiocb *req)
struct io_uring_task *tctx = req->task->io_uring;
unsigned long flags;
+ put_files_struct(req->work.identity->files);
+ put_nsproxy(req->work.identity->nsproxy);
spin_lock_irqsave(&ctx->inflight_lock, flags);
list_del(&req->inflight_entry);
- if (atomic_read(&tctx->in_idle))
- wake_up(&tctx->wait);
spin_unlock_irqrestore(&ctx->inflight_lock, flags);
req->flags &= ~REQ_F_INFLIGHT;
- put_files_struct(req->work.identity->files);
- put_nsproxy(req->work.identity->nsproxy);
req->work.flags &= ~IO_WQ_WORK_FILES;
+ if (atomic_read(&tctx->in_idle))
+ wake_up(&tctx->wait);
}
static void __io_clean_op(struct io_kiocb *req)
--
2.24.0
This reverts
commit f1f028ff89cb ("DTS: ARM: gta04: introduce legacy spi-cs-high to make display work again")
which had to be intruduced after
commit 6953c57ab172 ("gpio: of: Handle SPI chipselect legacy bindings")
broke the GTA04 display. This contradicted the data sheet but was the only
way to get it as an spi client operational again.
The panel data sheet defines the chip-select to be active low.
Now, with the arrival of
commit 766c6b63aa04 ("spi: fix client driver breakages when using GPIO descriptors")
the logic of interaction between spi-cs-high and the gpio descriptor flags
has been changed a second time, making the display broken again. So we have
to remove the original fix which in retrospect was a workaround of a bug in
the spi subsystem and not a feature of the panel or bug in the device tree.
With this fix the device tree is back in sync with the data sheet and
spi subsystem code.
Fixes: 766c6b63aa04 ("spi: fix client driver breakages when using GPIO descriptors")
CC: stable(a)vger.kernel.org
Signed-off-by: H. Nikolaus Schaller <hns(a)goldelico.com>
---
arch/arm/boot/dts/omap3-gta04.dtsi | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/arm/boot/dts/omap3-gta04.dtsi b/arch/arm/boot/dts/omap3-gta04.dtsi
index c8745bc800f71..003202d129907 100644
--- a/arch/arm/boot/dts/omap3-gta04.dtsi
+++ b/arch/arm/boot/dts/omap3-gta04.dtsi
@@ -124,7 +124,6 @@ lcd: td028ttec1@0 {
spi-max-frequency = <100000>;
spi-cpol;
spi-cpha;
- spi-cs-high;
backlight= <&backlight>;
label = "lcd";
--
2.26.2
On Fri, Dec 18, 2020 at 05:18:16AM +0800, Young Hsieh wrote:
> Hi Greg,
>
> Thanks. I am looking for the Essential, RAS & Perf patches for AMD Milan as follows:
>
I don't see anything here :(
> I am not familiar the rules for stable kernel patches, can you help to elaborate ? Thanks for your assistance! :)
Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for what stable kernels are all about.
thanks,
greg k-h
The patch titled
Subject: kasan: fix memory leak of kasan quarantine
has been added to the -mm tree. Its filename is
kasan-fix-memory-leak-of-kasan-quarantine.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/kasan-fix-memory-leak-of-kasan-qu…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/kasan-fix-memory-leak-of-kasan-qu…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Kuan-Ying Lee <Kuan-Ying.Lee(a)mediatek.com>
Subject: kasan: fix memory leak of kasan quarantine
When cpu is going offline, set q->offline as true and interrupt happened.
The interrupt may call the quarantine_put. But quarantine_put do not free
the the object. The object will cause memory leak.
Add qlink_free() to free the object.
Link: https://lkml.kernel.org/r/1608207487-30537-2-git-send-email-Kuan-Ying.Lee@m…
Fixes: 6c82d45c7f03 (kasan: fix object remaining in offline per-cpu quarantine)
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee(a)mediatek.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.10-]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/quarantine.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/kasan/quarantine.c~kasan-fix-memory-leak-of-kasan-quarantine
+++ a/mm/kasan/quarantine.c
@@ -191,6 +191,7 @@ void quarantine_put(struct kasan_free_me
q = this_cpu_ptr(&cpu_quarantine);
if (q->offline) {
+ qlink_free(&info->quarantine_link, cache);
local_irq_restore(flags);
return;
}
_
Patches currently in -mm which might be from Kuan-Ying.Lee(a)mediatek.com are
kasan-fix-memory-leak-of-kasan-quarantine.patch
On Fri, Dec 18, 2020 at 07:34:07AM +0800, Young Hsieh wrote:
>Hi Sasha,
>
>The other reason we are wondering if 5.4 has backporting is we are not very comfortable about using the bleeding edge. Do you think 5.10 is stable currently? Thanks for your help again. :)
Hi Young,
See https://www.kernel.org/releases.html for a list of the LTS branches.
The 5.10 kernel is already designated as LTS and should be used by most
stable tree users; this is the preferrable kernel to adopt right now.
The AMD hardware enablement stuff won't be backported on top of 5.4.
--
Thanks,
Sasha
Hello,
This is Young Hsieh from Uber and currently I am in Uber infra team and in charge of server system design. Nice to e-meet you! :)
We are working on AMD Milan platform with Debian, and notice there are some patches for performance and security improvements, which are not implemented in LTS kernels (4.14/4.19/5.4) yet. On our side, we prefer to use the general LTS kernel release instead of a customized kernel, in case we will not align on major fixes down the road. So would like to know if there is any plan to backport these patches and if so, what is the timeline? Thanks a lot again for any advice.
Cheers,
****************************************
Young Hsieh
Uber Hardware Engineer
****************************************
Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:
CPU1 CPU2
removexattr() freeze_super()
vfs_removexattr()
ext4_xattr_set()
handle = ext4_journal_start()
...
ext4_xattr_set_entry()
iput(old_ea_inode)
ext4_evict_inode(old_ea_inode)
sb->s_writers.frozen = SB_FREEZE_FS;
sb_wait_write(sb, SB_FREEZE_FS);
ext4_freeze()
jbd2_journal_lock_updates()
-> blocks waiting for all
handles to stop
sb_start_intwrite()
-> blocks as sb is already in SB_FREEZE_FS state
Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.
CC: stable(a)vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
CC: Tahsin Erdogan <tahsin(a)google.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/ext4/inode.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 72534319fae5..777eb08b29cd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -175,6 +175,7 @@ void ext4_evict_inode(struct inode *inode)
*/
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+ bool freeze_protected = false;
trace_ext4_evict_inode(inode);
@@ -232,9 +233,14 @@ void ext4_evict_inode(struct inode *inode)
/*
* Protect us against freezing - iput() caller didn't have to have any
- * protection against it
+ * protection against it. When we are in a running transaction though,
+ * we are already protected against freezing and we cannot grab further
+ * protection due to lock ordering constraints.
*/
- sb_start_intwrite(inode->i_sb);
+ if (!ext4_journal_current_handle()) {
+ sb_start_intwrite(inode->i_sb);
+ freeze_protected = true;
+ }
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -253,7 +259,8 @@ void ext4_evict_inode(struct inode *inode)
* cleaned up.
*/
ext4_orphan_del(NULL, inode);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
goto no_delete;
}
@@ -294,7 +301,8 @@ void ext4_evict_inode(struct inode *inode)
stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -323,7 +331,8 @@ void ext4_evict_inode(struct inode *inode)
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
no_delete:
--
2.16.4
The patch below does not apply to the 5.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 34c0f6f2695a2db81e09a3ab7bdb2853f45d4d3d Mon Sep 17 00:00:00 2001
From: "Maciej S. Szmigiero" <maciej.szmigiero(a)oracle.com>
Date: Sat, 5 Dec 2020 01:48:08 +0100
Subject: [PATCH] KVM: mmu: Fix SPTE encoding of MMIO generation upper half
Commit cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
cleaned up the computation of MMIO generation SPTE masks, however it
introduced a bug how the upper part was encoded:
SPTE bits 52-61 were supposed to contain bits 10-19 of the current
generation number, however a missing shift encoded bits 1-10 there instead
(mostly duplicating the lower part of the encoded generation number that
then consisted of bits 1-9).
In the meantime, the upper part was shrunk by one bit and moved by
subsequent commits to become an upper half of the encoded generation number
(bits 9-17 of bits 0-17 encoded in a SPTE).
In addition to the above, commit 56871d444bc4 ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
has changed the SPTE bit range assigned to encode the generation number and
the total number of bits encoded but did not update them in the comment
attached to their defines, nor in the KVM MMU doc.
Let's do it here, too, since it is too trivial thing to warrant a separate
commit.
Fixes: cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero(a)oracle.com>
Message-Id: <156700708db2a5296c5ed7a8b9ac71f1e9765c85.1607129096.git.maciej.szmigiero(a)oracle.com>
Cc: stable(a)vger.kernel.org
[Reorganize macros so that everything is computed from the bit ranges. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 1c030dbac7c4..5bfe28b0728e 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -455,7 +455,7 @@ If the generation number of the spte does not equal the global generation
number, it will ignore the cached MMIO information and handle the page
fault through the slow path.
-Since only 19 bits are used to store generation-number on mmio spte, all
+Since only 18 bits are used to store generation-number on mmio spte, all
pages are zapped when there is an overflow.
Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fcac2cac78fe..c51ad544f25b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -40,8 +40,8 @@ static u64 generation_mmio_spte_mask(u64 gen)
WARN_ON(gen & ~MMIO_SPTE_GEN_MASK);
BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK);
- mask = (gen << MMIO_SPTE_GEN_LOW_START) & MMIO_SPTE_GEN_LOW_MASK;
- mask |= (gen << MMIO_SPTE_GEN_HIGH_START) & MMIO_SPTE_GEN_HIGH_MASK;
+ mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
+ mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
return mask;
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5c75a451c000..2b3a30bd38b0 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -56,11 +56,11 @@
#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
/*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
* the memslots generation and is derived as follows:
*
* Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-18 of the MMIO generation are propagated to spte bits 52-61
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -69,18 +69,29 @@
* requires a full MMU zap). The flag is instead explicitly queried when
* checking for MMIO spte cache hits.
*/
-#define MMIO_SPTE_GEN_MASK GENMASK_ULL(17, 0)
#define MMIO_SPTE_GEN_LOW_START 3
#define MMIO_SPTE_GEN_LOW_END 11
-#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
- MMIO_SPTE_GEN_LOW_START)
#define MMIO_SPTE_GEN_HIGH_START PT64_SECOND_AVAIL_BITS_SHIFT
#define MMIO_SPTE_GEN_HIGH_END 62
+
+#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
+ MMIO_SPTE_GEN_LOW_START)
#define MMIO_SPTE_GEN_HIGH_MASK GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
MMIO_SPTE_GEN_HIGH_START)
+#define MMIO_SPTE_GEN_LOW_BITS (MMIO_SPTE_GEN_LOW_END - MMIO_SPTE_GEN_LOW_START + 1)
+#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
+
+/* remember to adjust the comment above as well if you change these */
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 9 && MMIO_SPTE_GEN_HIGH_BITS == 9);
+
+#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
+#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
+
+#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
+
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
@@ -228,8 +239,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
{
u64 gen;
- gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_START;
- gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_START;
+ gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_SHIFT;
+ gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_SHIFT;
return gen;
}
From: Fenghua Yu <fenghua.yu(a)intel.com>
Currently when moving a task to a resource group the PQR_ASSOC MSR
is updated with the new closid and rmid in an added task callback.
If the task is running the work is run as soon as possible. If the
task is not running the work is executed later in the kernel exit path
when the kernel returns to the task again.
Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when the
task is scheduled in) and causing system resource waste with the way in
which it is implemented: Work to update the PQR_ASSOC register is queued
every time the user writes a task id to the "tasks" file, even if the task
already belongs to the resource group. This could result in multiple pending
work items associated with a single task even if they are all identical and
even though only a single update with most recent values is needed.
Specifically, even if a task is moved between different resource groups
while it is sleeping then it is only the last move that is relevant but
yet a work item is queued during each move.
This unnecessary queueing of work items could result in significant system
resource waste, especially on tasks sleeping for a long time. For example,
as demonstrated by Shakeel Butt in [1] writing the same task id to the
"tasks" file can quickly consume significant memory. The same problem
(wasted system resources) occurs when moving a task between different
resource groups.
As pointed out by Valentin Schneider in [2] there is an additional issue with
the way in which the queueing of work is done in that the task_struct update
is currently done after the work is queued, resulting in a race with the
register update possibly done before the data needed by the update is available.
To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way
right after the new closid and rmid are ready during the task movement,
only if the task is running. If a moved task is not running nothing is
done since the PQR_ASSOC MSR will be updated next time the task is scheduled.
This is the same way used to update the register when tasks are moved as
part of resource group removal.
[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3…
[2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.…
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Valentin Schneider <valentin.schneider(a)arm.com>
Signed-off-by: Fenghua Yu <fenghua.yu(a)intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 123 ++++++++++---------------
1 file changed, 50 insertions(+), 73 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 68db7d2dec8f..9d62f1fadcc3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
kfree(rdtgrp);
}
+static void _update_task_closid_rmid(void *task)
+{
+ /*
+ * If the task is still current on this CPU, update PQR_ASSOC MSR.
+ * Otherwise, the MSR is updated when the task is scheduled in.
+ */
+ if (task == current)
+ resctrl_sched_in();
+}
+
#ifdef CONFIG_SMP
/* Get the CPU if the task is on it. */
static bool task_on_cpu(struct task_struct *t, int *cpu)
@@ -552,94 +562,61 @@ static void set_task_cpumask(struct task_struct *t, struct cpumask *mask)
if (mask && task_on_cpu(t, &cpu))
cpumask_set_cpu(cpu, mask);
}
-#else
-static inline void
-set_task_cpumask(struct task_struct *t, struct cpumask *mask) { }
-#endif
-
-struct task_move_callback {
- struct callback_head work;
- struct rdtgroup *rdtgrp;
-};
-static void move_myself(struct callback_head *head)
+static void update_task_closid_rmid(struct task_struct *t)
{
- struct task_move_callback *callback;
- struct rdtgroup *rdtgrp;
-
- callback = container_of(head, struct task_move_callback, work);
- rdtgrp = callback->rdtgrp;
-
- /*
- * If resource group was deleted before this task work callback
- * was invoked, then assign the task to root group and free the
- * resource group.
- */
- if (atomic_dec_and_test(&rdtgrp->waitcount) &&
- (rdtgrp->flags & RDT_DELETED)) {
- current->closid = 0;
- current->rmid = 0;
- rdtgroup_remove(rdtgrp);
- }
+ int cpu;
- if (unlikely(current->flags & PF_EXITING))
- goto out;
+ if (task_on_cpu(t, &cpu))
+ smp_call_function_single(cpu, _update_task_closid_rmid, t, 1);
+}
- preempt_disable();
- /* update PQR_ASSOC MSR to make resource group go into effect */
- resctrl_sched_in();
- preempt_enable();
+#else
+static inline void
+set_task_cpumask(struct task_struct *t, struct cpumask *mask) { }
-out:
- kfree(callback);
+static void update_task_closid_rmid(struct task_struct *t)
+{
+ _update_task_closid_rmid(t);
}
+#endif
static int __rdtgroup_move_task(struct task_struct *tsk,
struct rdtgroup *rdtgrp)
{
- struct task_move_callback *callback;
- int ret;
-
- callback = kzalloc(sizeof(*callback), GFP_KERNEL);
- if (!callback)
- return -ENOMEM;
- callback->work.func = move_myself;
- callback->rdtgrp = rdtgrp;
-
/*
- * Take a refcount, so rdtgrp cannot be freed before the
- * callback has been invoked.
+ * Set the task's closid/rmid before the PQR_ASSOC MSR can be
+ * updated by them.
+ *
+ * For ctrl_mon groups, move both closid and rmid.
+ * For monitor groups, can move the tasks only from
+ * their parent CTRL group.
*/
- atomic_inc(&rdtgrp->waitcount);
- ret = task_work_add(tsk, &callback->work, TWA_RESUME);
- if (ret) {
- /*
- * Task is exiting. Drop the refcount and free the callback.
- * No need to check the refcount as the group cannot be
- * deleted before the write function unlocks rdtgroup_mutex.
- */
- atomic_dec(&rdtgrp->waitcount);
- kfree(callback);
- rdt_last_cmd_puts("Task exited\n");
- } else {
- /*
- * For ctrl_mon groups move both closid and rmid.
- * For monitor groups, can move the tasks only from
- * their parent CTRL group.
- */
- if (rdtgrp->type == RDTCTRL_GROUP) {
- tsk->closid = rdtgrp->closid;
+
+ if (rdtgrp->type == RDTCTRL_GROUP) {
+ tsk->closid = rdtgrp->closid;
+ tsk->rmid = rdtgrp->mon.rmid;
+ } else if (rdtgrp->type == RDTMON_GROUP) {
+ if (rdtgrp->mon.parent->closid == tsk->closid) {
tsk->rmid = rdtgrp->mon.rmid;
- } else if (rdtgrp->type == RDTMON_GROUP) {
- if (rdtgrp->mon.parent->closid == tsk->closid) {
- tsk->rmid = rdtgrp->mon.rmid;
- } else {
- rdt_last_cmd_puts("Can't move task to different control group\n");
- ret = -EINVAL;
- }
+ } else {
+ rdt_last_cmd_puts("Can't move task to different control group\n");
+ return -EINVAL;
}
+ } else {
+ rdt_last_cmd_puts("Invalid resource group type\n");
+ return -EINVAL;
}
- return ret;
+
+ /*
+ * By now, the task's closid and rmid are set. If the task is current
+ * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
+ * group go into effect. If the task is not current, the MSR will be
+ * updated when the task is scheduled in.
+ */
+ update_task_closid_rmid(tsk);
+
+ return 0;
}
static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
--
2.26.2
When cpu is going offline, set q->offline as true
and interrupt happened. The interrupt may call the
quarantine_put. But quarantine_put do not free the
the object. The object will cause memory leak.
Add qlink_free() to free the object.
Kuan-Ying Lee (1):
kasan: fix memory leak of kasan quarantine
mm/kasan/quarantine.c | 1 +
1 file changed, 1 insertion(+)
--
2.18.0
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
It was believed that metag was the only architecture that required the ring
buffer to keep 8 byte words aligned on 8 byte architectures, and with its
removal, it was assumed that the ring buffer code did not need to handle
this case. It appears that sparc64 also requires this.
The following was reported on a sparc64 boot up:
kernel: futex hash table entries: 65536 (order: 9, 4194304 bytes, linear)
kernel: Running postponed tracer tests:
kernel: Testing tracer function:
kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
kernel: PASSED
Need to put back the 64BIT aligned code for the ring buffer.
Link: https://lore.kernel.org/r/CADxRZqzXQRYgKc=y-KV=S_yHL+Y8Ay2mh5ezeZUnpRvg+syW…
Cc: stable(a)vger.kernel.org
Fixes: 86b3de60a0b6 ("ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS")
Reported-by: Anatoly Pugachev <matorola(a)gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
arch/Kconfig | 16 ++++++++++++++++
kernel/trace/ring_buffer.c | 17 +++++++++++++----
2 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..fa716994f77e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -143,6 +143,22 @@ config UPROBES
managed by the kernel and kept transparent to the probed
application. )
+config HAVE_64BIT_ALIGNED_ACCESS
+ def_bool 64BIT && !HAVE_EFFICIENT_UNALIGNED_ACCESS
+ help
+ Some architectures require 64 bit accesses to be 64 bit
+ aligned, which also requires structs containing 64 bit values
+ to be 64 bit aligned too. This includes some 32 bit
+ architectures which can do 64 bit accesses, as well as 64 bit
+ architectures without unaligned access.
+
+ This symbol should be selected by an architecture if 64 bit
+ accesses are required to be 64 bit aligned in this way even
+ though it is not a 64 bit architecture.
+
+ See Documentation/unaligned-memory-access.txt for more
+ information on the topic of unaligned memory accesses.
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e03bc4e5d482..926845eb5ab5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -130,7 +130,16 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
#define RB_ALIGNMENT 4U
#define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
#define RB_EVNT_MIN_SIZE 8U /* two 32bit words */
-#define RB_ALIGN_DATA __aligned(RB_ALIGNMENT)
+
+#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS
+# define RB_FORCE_8BYTE_ALIGNMENT 0
+# define RB_ARCH_ALIGNMENT RB_ALIGNMENT
+#else
+# define RB_FORCE_8BYTE_ALIGNMENT 1
+# define RB_ARCH_ALIGNMENT 8U
+#endif
+
+#define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT)
/* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */
#define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX
@@ -2718,7 +2727,7 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
event->time_delta = delta;
length -= RB_EVNT_HDR_SIZE;
- if (length > RB_MAX_SMALL_DATA) {
+ if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) {
event->type_len = 0;
event->array[0] = length;
} else
@@ -2733,11 +2742,11 @@ static unsigned rb_calculate_event_length(unsigned length)
if (!length)
length++;
- if (length > RB_MAX_SMALL_DATA)
+ if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT)
length += sizeof(event.array[0]);
length += RB_EVNT_HDR_SIZE;
- length = ALIGN(length, RB_ALIGNMENT);
+ length = ALIGN(length, RB_ARCH_ALIGNMENT);
/*
* In case the time delta is larger than the 27 bits for it
--
2.29.2
The vfio_ap device driver registers a group notifier with VFIO when the
file descriptor for a VFIO mediated device for a KVM guest is opened to
receive notification that the KVM pointer is set (VFIO_GROUP_NOTIFY_SET_KVM
event). When the KVM pointer is set, the vfio_ap driver takes the
following actions:
1. Stashes the KVM pointer in the vfio_ap_mdev struct that holds the state
of the mediated device.
2. Calls the kvm_get_kvm() function to increment its reference counter.
3. Sets the function pointer to the function that handles interception of
the instruction that enables/disables interrupt processing.
4. Sets the masks in the KVM guest's CRYCB to pass AP resources through to
the guest.
In order to avoid memory leaks, when the notifier is called to receive
notification that the KVM pointer has been set to NULL, the vfio_ap device
driver should reverse the actions taken when the KVM pointer was set.
Fixes: 258287c994de ("s390: vfio-ap: implement mediated device open callback")
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
---
drivers/s390/crypto/vfio_ap_ops.c | 29 ++++++++++++++++++++---------
1 file changed, 20 insertions(+), 9 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e0bde8518745..cd22e85588e1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1037,8 +1037,6 @@ static int vfio_ap_mdev_set_kvm(struct ap_matrix_mdev *matrix_mdev,
{
struct ap_matrix_mdev *m;
- mutex_lock(&matrix_dev->lock);
-
list_for_each_entry(m, &matrix_dev->mdev_list, node) {
if ((m != matrix_mdev) && (m->kvm == kvm)) {
mutex_unlock(&matrix_dev->lock);
@@ -1049,7 +1047,6 @@ static int vfio_ap_mdev_set_kvm(struct ap_matrix_mdev *matrix_mdev,
matrix_mdev->kvm = kvm;
kvm_get_kvm(kvm);
kvm->arch.crypto.pqap_hook = &matrix_mdev->pqap_hook;
- mutex_unlock(&matrix_dev->lock);
return 0;
}
@@ -1083,35 +1080,49 @@ static int vfio_ap_mdev_iommu_notifier(struct notifier_block *nb,
return NOTIFY_DONE;
}
+static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
+{
+ kvm_arch_crypto_clear_masks(matrix_mdev->kvm);
+ matrix_mdev->kvm->arch.crypto.pqap_hook = NULL;
+ vfio_ap_mdev_reset_queues(matrix_mdev->mdev);
+ kvm_put_kvm(matrix_mdev->kvm);
+ matrix_mdev->kvm = NULL;
+}
+
static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
- int ret;
+ int ret, notify_rc = NOTIFY_DONE;
struct ap_matrix_mdev *matrix_mdev;
if (action != VFIO_GROUP_NOTIFY_SET_KVM)
return NOTIFY_OK;
matrix_mdev = container_of(nb, struct ap_matrix_mdev, group_notifier);
+ mutex_lock(&matrix_dev->lock);
if (!data) {
- matrix_mdev->kvm = NULL;
- return NOTIFY_OK;
+ if (matrix_mdev->kvm)
+ vfio_ap_mdev_unset_kvm(matrix_mdev);
+ notify_rc = NOTIFY_OK;
+ goto notify_done;
}
ret = vfio_ap_mdev_set_kvm(matrix_mdev, data);
if (ret)
- return NOTIFY_DONE;
+ goto notify_done;
/* If there is no CRYCB pointer, then we can't copy the masks */
if (!matrix_mdev->kvm->arch.crypto.crycbd)
- return NOTIFY_DONE;
+ goto notify_done;
kvm_arch_crypto_set_masks(matrix_mdev->kvm, matrix_mdev->matrix.apm,
matrix_mdev->matrix.aqm,
matrix_mdev->matrix.adm);
- return NOTIFY_OK;
+notify_done:
+ mutex_unlock(&matrix_dev->lock);
+ return notify_rc;
}
static void vfio_ap_irq_disable_apqn(int apqn)
--
2.21.1
If Makefile cannot find any of the vmlinux's in its VMLINUX_BTF_PATHS list,
it tries to run btftool incorrectly, with VMLINUX_BTF unset:
bpftool btf dump file $(VMLINUX_BTF) format c
Such that the keyword 'format' is misinterpreted as the path to vmlinux.
The resulting build error message is fairly cryptic:
GEN vmlinux.h
Error: failed to load BTF from format: No such file or directory
This patch makes the failure reason clearer by yielding this instead:
Makefile:...: *** cannot find a vmlinux for VMLINUX_BTF at any of
"{paths}". Stop.
Fixes: acbd06206bbb ("selftests/bpf: Add vmlinux.h selftest exercising tracing of syscalls")
Cc: stable(a)vger.kernel.org # 5.7+
Signed-off-by: Kamal Mostafa <kamal(a)canonical.com>
---
tools/testing/selftests/bpf/Makefile | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 542768f5195b..93ed34ef6e3f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -196,6 +196,9 @@ $(BUILD_DIR)/libbpf $(BUILD_DIR)/bpftool $(BUILD_DIR)/resolve_btfids $(INCLUDE_D
$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) | $(BPFTOOL) $(INCLUDE_DIR)
ifeq ($(VMLINUX_H),)
$(call msg,GEN,,$@)
+ifeq ($(VMLINUX_BTF),)
+$(error cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
+endif
$(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
else
$(call msg,CP,,$@)
--
2.17.1
A widget's "dirty" list_head, much like its "list" list_head, eventually
chains back to a list_head on the snd_soc_card itself. This means that
the list can stick around even after the widget (or all widgets) have
been freed. Currently, however, widgets that are in the dirty list when
freed remain there, corrupting the entire list and leading to memory
errors and undefined behavior when the list is next accessed or
modified.
I encountered this issue when a component failed to probe relatively
late in snd_soc_bind_card(), causing it to bail out and call
soc_cleanup_card_resources(), which eventually called
snd_soc_dapm_free() with widgets that were still dirty from when they'd
been added.
Fixes: db432b414e20 ("ASoC: Do DAPM power checks only for widgets changed since last run")
Cc: stable(a)vger.kernel.org
Signed-off-by: Thomas Hebb <tommyhebb(a)gmail.com>
---
sound/soc/soc-dapm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c
index 7f87b449f950..148c095df27b 100644
--- a/sound/soc/soc-dapm.c
+++ b/sound/soc/soc-dapm.c
@@ -2486,6 +2486,7 @@ void snd_soc_dapm_free_widget(struct snd_soc_dapm_widget *w)
enum snd_soc_dapm_direction dir;
list_del(&w->list);
+ list_del(&w->dirty);
/*
* remove source and sink paths associated to this widget.
* While removing the path, remove reference to it from both
--
2.29.2
I'm announcing the release of the 5.4.84 kernel.
All users of the 5.4 kernel series must upgrade.
The updated 5.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 5 +
arch/arc/kernel/stacktrace.c | 23 ++++---
arch/arm64/boot/dts/broadcom/stingray/stingray-usb.dtsi | 20 +++---
arch/arm64/boot/dts/nvidia/tegra186-p2771-0000.dts | 12 ---
arch/arm64/boot/dts/rockchip/rk3399.dtsi | 3
arch/powerpc/Makefile | 1
arch/x86/include/asm/pgtable_types.h | 1
arch/x86/include/asm/sync_core.h | 9 +-
arch/x86/kernel/apic/vector.c | 24 ++++---
arch/x86/lib/memcpy_64.S | 4 -
arch/x86/lib/memmove_64.S | 4 -
arch/x86/lib/memset_64.S | 4 -
arch/x86/mm/mem_encrypt_identity.c | 4 -
arch/x86/mm/tlb.c | 10 ++-
drivers/gpu/drm/i915/display/intel_dp.c | 2
drivers/input/misc/cm109.c | 7 +-
drivers/input/serio/i8042-x86ia64io.h | 42 +++++++++++++
drivers/interconnect/qcom/qcs404.c | 4 -
drivers/irqchip/irq-gic-v3-its.c | 16 -----
drivers/mmc/core/block.c | 2
drivers/net/can/m_can/m_can.c | 2
drivers/net/ethernet/ibm/ibmvnic.c | 6 +
drivers/net/wireless/intel/iwlwifi/iwl-csr.h | 10 +++
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c | 2
drivers/net/wireless/intel/iwlwifi/pcie/ctxt-info-gen3.c | 20 ++++++
drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 36 ++++++++---
drivers/pinctrl/pinctrl-amd.c | 7 --
drivers/platform/x86/acer-wmi.c | 1
drivers/platform/x86/intel-vbtn.c | 6 +
drivers/platform/x86/thinkpad_acpi.c | 10 ++-
drivers/platform/x86/touchscreen_dmi.c | 23 +++++++
drivers/scsi/be2iscsi/be_main.c | 4 -
drivers/scsi/ufs/ufshcd.c | 7 ++
drivers/soc/fsl/dpio/dpio-driver.c | 5 -
drivers/spi/spi-nxp-fspi.c | 7 ++
fs/proc/task_mmu.c | 8 +-
include/linux/build_bug.h | 5 +
include/linux/compiler-clang.h | 6 -
include/linux/compiler-gcc.h | 19 ------
include/linux/compiler.h | 18 +++++
include/linux/zsmalloc.h | 1
mm/Kconfig | 13 ----
mm/zsmalloc.c | 46 ---------------
tools/testing/ktest/ktest.pl | 2
44 files changed, 269 insertions(+), 192 deletions(-)
Andy Lutomirski (1):
x86/membarrier: Get rid of a dubious optimization
Arnd Bergmann (1):
kbuild: avoid static_assert for genksyms
Arvind Sankar (2):
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
compiler.h: fix barrier_data() on clang
Bean Huo (1):
mmc: block: Fixup condition for CMD13 polling for RPMB requests
Can Guo (1):
scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE
Chris Chiu (1):
Input: i8042 - add Acer laptops to the i8042 reset list
Coiby Xu (1):
pinctrl: amd: remove debounce filter setting in IRQ type setting
Dan Carpenter (1):
scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()"
Dmitry Torokhov (1):
Input: cm109 - do not stomp on control URB
Fangrui Song (1):
x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
Georgi Djakov (1):
interconnect: qcom: qcs404: Remove GPU and display RPM IDs
Greg Kroah-Hartman (1):
Linux 5.4.84
Hans de Goede (3):
platform/x86: thinkpad_acpi: Do not report SW_TABLET_MODE on Yoga 11e
platform/x86: thinkpad_acpi: Add BAT1 is primary battery quirk for Thinkpad Yoga 11e 4th gen
platform/x86: touchscreen_dmi: Add info for the Irbis TW118 tablet
Hao Si (1):
soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)
Johannes Berg (2):
iwlwifi: pcie: limit memory read spin time
iwlwifi: pcie: set LTR to avoid completion timeout
Jon Hunter (1):
arm64: tegra: Disable the ACONNECT for Jetson TX2
Libo Chen (1):
ktest.pl: Fix incorrect reboot for grub2bls
Lijun Pan (1):
ibmvnic: skip tx timeout reset while in resetting
Manasi Navare (1):
drm/i915/display/dp: Compute the correct slice count for VDSC on DP
Markus Reichl (1):
arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards.
Max Verevkin (1):
platform/x86: intel-vbtn: Support for tablet mode on HP Pavilion 13 x360 PC
Michael Ellerman (1):
powerpc: Drop -me200 addition to build flags
Miles Chen (1):
proc: use untagged_addr() for pagemap_read addresses
Minchan Kim (1):
mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
Nick Desaulniers (1):
Kbuild: do not emit debug info for assembly with LLVM_IAS=1
Pankaj Sharma (1):
can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0
Ran Wang (1):
spi: spi-nxp-fspi: fix fspi panic by unexpected interrupts
Sara Sharon (1):
iwlwifi: mvm: fix kernel panic in case of assert during CSA
Thomas Gleixner (1):
x86/apic/vector: Fix ordering in vector assignment
Timo Witte (1):
platform/x86: acer-wmi: add automatic keyboard background light toggle key as KEY_LIGHTS_TOGGLE
Vineet Gupta (1):
ARC: stack unwinding: don't assume non-current task is sleeping
Xu Qiang (1):
irqchip/gic-v3-its: Unconditionally save/restore the ITS state on suspend
Zhen Lei (1):
arm64: dts: broadcom: clear the warnings caused by empty dma-ranges
This is the start of the stable review cycle for the 5.10.1 release.
There are 2 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Monday, 14 Dec 2020 18:04:42 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.1-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.1-rc1
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "dm raid: fix discard limits for raid1 and raid10"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "md: change mddev 'chunk_sectors' from int to unsigned"
-------------
Diffstat:
Makefile | 4 ++--
drivers/md/dm-raid.c | 12 +++++-------
drivers/md/md.h | 4 ++--
3 files changed, 9 insertions(+), 11 deletions(-)
Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:
CPU1 CPU2
removexattr() freeze_super()
vfs_removexattr()
ext4_xattr_set()
handle = ext4_journal_start()
...
ext4_xattr_set_entry()
iput(old_ea_inode)
ext4_evict_inode(old_ea_inode)
sb->s_writers.frozen = SB_FREEZE_FS;
sb_wait_write(sb, SB_FREEZE_FS);
ext4_freeze()
jbd2_journal_lock_updates()
-> blocks waiting for all
handles to stop
sb_start_intwrite()
-> blocks as sb is already in SB_FREEZE_FS state
Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.
CC: stable(a)vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
CC: Tahsin Erdogan <tahsin(a)google.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/ext4/inode.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 72534319fae5..777eb08b29cd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -175,6 +175,7 @@ void ext4_evict_inode(struct inode *inode)
*/
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+ bool freeze_protected = false;
trace_ext4_evict_inode(inode);
@@ -232,9 +233,14 @@ void ext4_evict_inode(struct inode *inode)
/*
* Protect us against freezing - iput() caller didn't have to have any
- * protection against it
+ * protection against it. When we are in a running transaction though,
+ * we are already protected against freezing and we cannot grab further
+ * protection due to lock ordering constraints.
*/
- sb_start_intwrite(inode->i_sb);
+ if (!ext4_journal_current_handle()) {
+ sb_start_intwrite(inode->i_sb);
+ freeze_protected = true;
+ }
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -253,7 +259,8 @@ void ext4_evict_inode(struct inode *inode)
* cleaned up.
*/
ext4_orphan_del(NULL, inode);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
goto no_delete;
}
@@ -294,7 +301,8 @@ void ext4_evict_inode(struct inode *inode)
stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -323,7 +331,8 @@ void ext4_evict_inode(struct inode *inode)
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
- sb_end_intwrite(inode->i_sb);
+ if (freeze_protected)
+ sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
no_delete:
--
2.16.4
The patch titled
Subject: z3fold: simplify freeing slots
has been removed from the -mm tree. Its filename was
z3fold-simplify-freeing-slots.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Vitaly Wool <vitaly.wool(a)konsulko.com>
Subject: z3fold: simplify freeing slots
Patch series "z3fold: stability / rt fixes".
Address z3fold stability issues under stress load, primarily in the
reclaim and free aspects. Besides, it fixes the locking problems that
were only seen in real-time kernel configuration.
This patch (of 3):
There used to be two places in the code where slots could be freed, namely
when freeing the last allocated handle from the slots and when releasing
the z3fold header these slots aree linked to. The logic to decide on
whether to free certain slots was complicated and error prone in both
functions and it led to failures in RT case.
To fix that, make free_handle() the single point of freeing slots.
Link: https://lkml.kernel.org/r/20201209145151.18994-1-vitaly.wool@konsulko.com
Link: https://lkml.kernel.org/r/20201209145151.18994-2-vitaly.wool@konsulko.com
Signed-off-by: Vitaly Wool <vitaly.wool(a)konsulko.com>
Tested-by: Mike Galbraith <efault(a)gmx.de>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/z3fold.c | 55 +++++++++++---------------------------------------
1 file changed, 13 insertions(+), 42 deletions(-)
--- a/mm/z3fold.c~z3fold-simplify-freeing-slots
+++ a/mm/z3fold.c
@@ -90,7 +90,7 @@ struct z3fold_buddy_slots {
* be enough slots to hold all possible variants
*/
unsigned long slot[BUDDY_MASK + 1];
- unsigned long pool; /* back link + flags */
+ unsigned long pool; /* back link */
rwlock_t lock;
};
#define HANDLE_FLAG_MASK (0x03)
@@ -182,13 +182,6 @@ enum z3fold_page_flags {
};
/*
- * handle flags, go under HANDLE_FLAG_MASK
- */
-enum z3fold_handle_flags {
- HANDLES_ORPHANED = 0,
-};
-
-/*
* Forward declarations
*/
static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, bool);
@@ -303,10 +296,9 @@ static inline void put_z3fold_header(str
z3fold_page_unlock(zhdr);
}
-static inline void free_handle(unsigned long handle)
+static inline void free_handle(unsigned long handle, struct z3fold_header *zhdr)
{
struct z3fold_buddy_slots *slots;
- struct z3fold_header *zhdr;
int i;
bool is_free;
@@ -316,22 +308,13 @@ static inline void free_handle(unsigned
if (WARN_ON(*(unsigned long *)handle == 0))
return;
- zhdr = handle_to_z3fold_header(handle);
slots = handle_to_slots(handle);
write_lock(&slots->lock);
*(unsigned long *)handle = 0;
- if (zhdr->slots == slots) {
- write_unlock(&slots->lock);
- return; /* simple case, nothing else to do */
- }
+ if (zhdr->slots != slots)
+ zhdr->foreign_handles--;
- /* we are freeing a foreign handle if we are here */
- zhdr->foreign_handles--;
is_free = true;
- if (!test_bit(HANDLES_ORPHANED, &slots->pool)) {
- write_unlock(&slots->lock);
- return;
- }
for (i = 0; i <= BUDDY_MASK; i++) {
if (slots->slot[i]) {
is_free = false;
@@ -343,6 +326,8 @@ static inline void free_handle(unsigned
if (is_free) {
struct z3fold_pool *pool = slots_to_pool(slots);
+ if (zhdr->slots == slots)
+ zhdr->slots = NULL;
kmem_cache_free(pool->c_handle, slots);
}
}
@@ -525,8 +510,6 @@ static void __release_z3fold_page(struct
{
struct page *page = virt_to_page(zhdr);
struct z3fold_pool *pool = zhdr_to_pool(zhdr);
- bool is_free = true;
- int i;
WARN_ON(!list_empty(&zhdr->buddy));
set_bit(PAGE_STALE, &page->private);
@@ -536,21 +519,6 @@ static void __release_z3fold_page(struct
list_del_init(&page->lru);
spin_unlock(&pool->lock);
- /* If there are no foreign handles, free the handles array */
- read_lock(&zhdr->slots->lock);
- for (i = 0; i <= BUDDY_MASK; i++) {
- if (zhdr->slots->slot[i]) {
- is_free = false;
- break;
- }
- }
- if (!is_free)
- set_bit(HANDLES_ORPHANED, &zhdr->slots->pool);
- read_unlock(&zhdr->slots->lock);
-
- if (is_free)
- kmem_cache_free(pool->c_handle, zhdr->slots);
-
if (locked)
z3fold_page_unlock(zhdr);
@@ -973,6 +941,9 @@ lookup:
}
}
+ if (zhdr && !zhdr->slots)
+ zhdr->slots = alloc_slots(pool,
+ can_sleep ? GFP_NOIO : GFP_ATOMIC);
return zhdr;
}
@@ -1270,7 +1241,7 @@ static void z3fold_free(struct z3fold_po
}
if (!page_claimed)
- free_handle(handle);
+ free_handle(handle, zhdr);
if (kref_put(&zhdr->refcount, release_z3fold_page_locked_list)) {
atomic64_dec(&pool->pages_nr);
return;
@@ -1429,19 +1400,19 @@ static int z3fold_reclaim_page(struct z3
ret = pool->ops->evict(pool, middle_handle);
if (ret)
goto next;
- free_handle(middle_handle);
+ free_handle(middle_handle, zhdr);
}
if (first_handle) {
ret = pool->ops->evict(pool, first_handle);
if (ret)
goto next;
- free_handle(first_handle);
+ free_handle(first_handle, zhdr);
}
if (last_handle) {
ret = pool->ops->evict(pool, last_handle);
if (ret)
goto next;
- free_handle(last_handle);
+ free_handle(last_handle, zhdr);
}
next:
if (test_bit(PAGE_HEADLESS, &page->private)) {
_
Patches currently in -mm which might be from vitaly.wool(a)konsulko.com are
syzbot reported the deadlock here [1]. The issue is in hugetlb cow
error handling when there are not enough huge pages for the faulting
task which took the original reservation. It is possible that other
(child) tasks could have consumed pages associated with the reservation.
In this case, we want the task which took the original reservation to
succeed. So, we unmap any associated pages in children so that they
can be used by the faulting task that owns the reservation.
The unmapping code needs to hold i_mmap_rwsem in write mode. However,
due to commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd
sharing synchronization") we are already holding i_mmap_rwsem in read
mode when hugetlb_cow is called. Technically, i_mmap_rwsem does not
need to be held in read mode for COW mappings as they can not share
pmd's. Modifying the fault code to not take i_mmap_rwsem in read mode
for COW (and other non-sharable) mappings is too involved for a stable
fix. Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem
before unmapping. This is OK as it is technically not needed. They
are reacquired after unmapping as expected by calling code. Since this
is done in an uncommon error path, the overhead of dropping and
reacquiring mutexes is acceptable.
While making changes, remove redundant BUG_ON after unmap_ref_private.
[1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com
Reported-by: syzbot+5eee4145df3c15e96625(a)syzkaller.appspotmail.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
mm/hugetlb.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d029d938d26d..8713f8ef0f4c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4106,10 +4106,30 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
* may get SIGKILLed if it later faults.
*/
if (outside_reserve) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t idx;
+ u32 hash;
+
put_page(old_page);
BUG_ON(huge_pte_none(pte));
+ /*
+ * Drop hugetlb_fault_mutex and i_mmap_rwsem before
+ * unmapping. unmapping needs to hold i_mmap_rwsem
+ * in write mode. Dropping i_mmap_rwsem in read mode
+ * here is OK as COW mappings do not interact with
+ * PMD sharing.
+ *
+ * Reacquire both after unmap operation.
+ */
+ idx = vma_hugecache_offset(h, vma, haddr);
+ hash = hugetlb_fault_mutex_hash(mapping, idx);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
+
unmap_ref_private(mm, vma, old_page, haddr);
- BUG_ON(huge_pte_none(pte));
+
+ i_mmap_lock_read(vma->vm_file->f_mapping);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
--
2.29.2
When inserting a VMA, we restrict the placement to the low 4G unless the
caller opts into using the full range. This was done to allow usersapce
the opportunity to transition slowly from a 32b address space, and to
avoid breaking inherent 32b assumptions of some commands.
However, for insert we limited ourselves to 4G-4K, but on verification
we allowed the full 4G. This causes some attempts to bind a new buffer
to sporadically fail with -ENOSPC, but at other times be bound
successfully.
commit 48ea1e32c39d ("drm/i915/gen9: Set PIN_ZONE_4G end to 4GB - 1
page") suggests that there is a genuine problem with stateless addressing
that cannot utilize the last page in 4G and so we purposefully excluded
it.
Reported-by: CQ Tang <cq.tang(a)intel.com>
Fixes: 48ea1e32c39d ("drm/i915/gen9: Set PIN_ZONE_4G end to 4GB - 1 page")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: CQ Tang <cq.tang(a)intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 193996144c84..2ff32daa50bd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -382,7 +382,7 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
return true;
if (!(flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) &&
- (vma->node.start + vma->node.size - 1) >> 32)
+ (vma->node.start + vma->node.size + 4095) >> 32)
return true;
if (flags & __EXEC_OBJECT_NEEDS_MAP &&
--
2.20.1
This is the start of the stable review cycle for the 5.4.84 release.
There are 36 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 16 Dec 2020 17:25:32 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.84-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.4.84-rc1
Arvind Sankar <nivedita(a)alum.mit.edu>
compiler.h: fix barrier_data() on clang
Minchan Kim <minchan(a)kernel.org>
mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
Thomas Gleixner <tglx(a)linutronix.de>
x86/apic/vector: Fix ordering in vector assignment
Andy Lutomirski <luto(a)kernel.org>
x86/membarrier: Get rid of a dubious optimization
Arvind Sankar <nivedita(a)alum.mit.edu>
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
Dan Carpenter <dan.carpenter(a)oracle.com>
scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()"
Miles Chen <miles.chen(a)mediatek.com>
proc: use untagged_addr() for pagemap_read addresses
Arnd Bergmann <arnd(a)arndb.de>
kbuild: avoid static_assert for genksyms
Manasi Navare <manasi.d.navare(a)intel.com>
drm/i915/display/dp: Compute the correct slice count for VDSC on DP
Bean Huo <beanhuo(a)micron.com>
mmc: block: Fixup condition for CMD13 polling for RPMB requests
Coiby Xu <coiby.xu(a)gmail.com>
pinctrl: amd: remove debounce filter setting in IRQ type setting
Chris Chiu <chiu(a)endlessos.org>
Input: i8042 - add Acer laptops to the i8042 reset list
Dmitry Torokhov <dmitry.torokhov(a)gmail.com>
Input: cm109 - do not stomp on control URB
Libo Chen <libo.chen(a)oracle.com>
ktest.pl: Fix incorrect reboot for grub2bls
Pankaj Sharma <pankj.sharma(a)samsung.com>
can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: touchscreen_dmi: Add info for the Irbis TW118 tablet
Max Verevkin <me(a)maxverevkin.tk>
platform/x86: intel-vbtn: Support for tablet mode on HP Pavilion 13 x360 PC
Timo Witte <timo.witte(a)gmail.com>
platform/x86: acer-wmi: add automatic keyboard background light toggle key as KEY_LIGHTS_TOGGLE
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: thinkpad_acpi: Add BAT1 is primary battery quirk for Thinkpad Yoga 11e 4th gen
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: thinkpad_acpi: Do not report SW_TABLET_MODE on Yoga 11e
Jon Hunter <jonathanh(a)nvidia.com>
arm64: tegra: Disable the ACONNECT for Jetson TX2
Hao Si <si.hao(a)zte.com.cn>
soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)
Ran Wang <ran.wang_1(a)nxp.com>
spi: spi-nxp-fspi: fix fspi panic by unexpected interrupts
Xu Qiang <xuqiang36(a)huawei.com>
irqchip/gic-v3-its: Unconditionally save/restore the ITS state on suspend
Lijun Pan <ljp(a)linux.ibm.com>
ibmvnic: skip tx timeout reset while in resetting
Georgi Djakov <georgi.djakov(a)linaro.org>
interconnect: qcom: qcs404: Remove GPU and display RPM IDs
Can Guo <cang(a)codeaurora.org>
scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE
Vineet Gupta <vgupta(a)synopsys.com>
ARC: stack unwinding: don't assume non-current task is sleeping
Zhen Lei <thunder.leizhen(a)huawei.com>
arm64: dts: broadcom: clear the warnings caused by empty dma-ranges
Michael Ellerman <mpe(a)ellerman.id.au>
powerpc: Drop -me200 addition to build flags
Sara Sharon <sara.sharon(a)intel.com>
iwlwifi: mvm: fix kernel panic in case of assert during CSA
Johannes Berg <johannes.berg(a)intel.com>
iwlwifi: pcie: set LTR to avoid completion timeout
Markus Reichl <m.reichl(a)fivetechno.de>
arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards.
Johannes Berg <johannes.berg(a)intel.com>
iwlwifi: pcie: limit memory read spin time
Fangrui Song <maskray(a)google.com>
x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
Nick Desaulniers <ndesaulniers(a)google.com>
Kbuild: do not emit debug info for assembly with LLVM_IAS=1
-------------
Diffstat:
Makefile | 7 +++-
arch/arc/kernel/stacktrace.c | 23 +++++++----
.../boot/dts/broadcom/stingray/stingray-usb.dtsi | 20 +++++-----
arch/arm64/boot/dts/nvidia/tegra186-p2771-0000.dts | 12 ------
arch/arm64/boot/dts/rockchip/rk3399.dtsi | 3 ++
arch/powerpc/Makefile | 1 -
arch/x86/include/asm/pgtable_types.h | 1 +
arch/x86/include/asm/sync_core.h | 9 +++--
arch/x86/kernel/apic/vector.c | 24 ++++++-----
arch/x86/lib/memcpy_64.S | 4 +-
arch/x86/lib/memmove_64.S | 4 +-
arch/x86/lib/memset_64.S | 4 +-
arch/x86/mm/mem_encrypt_identity.c | 4 +-
arch/x86/mm/tlb.c | 10 ++++-
drivers/gpu/drm/i915/display/intel_dp.c | 2 +-
drivers/input/misc/cm109.c | 7 +++-
drivers/input/serio/i8042-x86ia64io.h | 42 ++++++++++++++++++++
drivers/interconnect/qcom/qcs404.c | 4 +-
drivers/irqchip/irq-gic-v3-its.c | 16 ++------
drivers/mmc/core/block.c | 2 +-
drivers/net/can/m_can/m_can.c | 2 +
drivers/net/ethernet/ibm/ibmvnic.c | 6 +++
drivers/net/wireless/intel/iwlwifi/iwl-csr.h | 10 +++++
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c | 2 +-
.../wireless/intel/iwlwifi/pcie/ctxt-info-gen3.c | 20 ++++++++++
drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 36 ++++++++++++-----
drivers/pinctrl/pinctrl-amd.c | 7 ----
drivers/platform/x86/acer-wmi.c | 1 +
drivers/platform/x86/intel-vbtn.c | 6 +++
drivers/platform/x86/thinkpad_acpi.c | 10 ++++-
drivers/platform/x86/touchscreen_dmi.c | 23 +++++++++++
drivers/scsi/be2iscsi/be_main.c | 4 +-
drivers/scsi/ufs/ufshcd.c | 7 ++++
drivers/soc/fsl/dpio/dpio-driver.c | 5 +--
drivers/spi/spi-nxp-fspi.c | 7 ++++
fs/proc/task_mmu.c | 8 +++-
include/linux/build_bug.h | 5 +++
include/linux/compiler-clang.h | 6 ---
include/linux/compiler-gcc.h | 19 ---------
include/linux/compiler.h | 18 ++++++++-
include/linux/zsmalloc.h | 1 -
mm/Kconfig | 13 ------
mm/zsmalloc.c | 46 ----------------------
tools/testing/ktest/ktest.pl | 2 +-
44 files changed, 270 insertions(+), 193 deletions(-)
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an -EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. This means that caching a block group before the
transaction is committed can race with other operations modifying the
free space tree, and thus you can get double adds and other sort of
shenanigans. This is only a problem for the first transaction, once
we've committed the transaction we created the free space tree in we're
OK to use the free space tree to cache block groups.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
cc: stable(a)vger.kernel.org
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
fs/btrfs/block-group.c | 11 ++++++++++-
fs/btrfs/ctree.h | 3 +++
fs/btrfs/free-space-tree.c | 10 +++++++++-
3 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 52f2198d44c9..b8bbdd95743e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -673,7 +673,16 @@ static noinline void caching_thread(struct btrfs_work *work)
wake_up(&caching_ctl->wait);
}
- if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+ /*
+ * If we are in the transaction that populated the free space tree we
+ * can't actually cache from the free space tree as our commit root and
+ * real root are the same, so we could change the contents of the blocks
+ * while caching. Instead do the slow caching in this case, and after
+ * the transaction has committed we will be safe.
+ */
+ if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
+ !(test_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+ &fs_info->flags)))
ret = load_free_space_tree(caching_ctl);
else
ret = load_extent_tree_free(caching_ctl);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3935d297d198..4a60d81da5cb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -562,6 +562,9 @@ enum {
/* Indicate that we need to cleanup space cache v1 */
BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
+
+ /* Indicate that we can't trust the free space tree for caching yet. */
+ BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
};
/*
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index e33a65bd9a0c..a33bca94d133 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1150,6 +1150,7 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
return PTR_ERR(trans);
set_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+ set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
free_space_root = btrfs_create_tree(trans,
BTRFS_FREE_SPACE_TREE_OBJECTID);
if (IS_ERR(free_space_root)) {
@@ -1171,11 +1172,18 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID);
clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+ ret = btrfs_commit_transaction(trans);
- return btrfs_commit_transaction(trans);
+ /*
+ * Now that we've committed the transaction any reading of our commit
+ * root will be safe, so we can cache from the free space tree now.
+ */
+ clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
+ return ret;
abort:
clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+ clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
btrfs_abort_transaction(trans, ret);
btrfs_end_transaction(trans);
return ret;
--
2.26.2
Make sure to always cancel the control URB in write() so that it can be
reused after a timeout or spurious CMD_ACK.
Currently any further write requests after a timeout would fail after
triggering a WARN() in usb_submit_urb() when attempting to submit the
already active URB.
Reported-by: syzbot+e87ebe0f7913f71f2ea5(a)syzkaller.appspotmail.com
Fixes: 6bc235a2e24a ("USB: add driver for Meywa-Denki & Kayac YUREX")
Cc: stable <stable(a)vger.kernel.org> # 2.6.37
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/misc/yurex.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/usb/misc/yurex.c b/drivers/usb/misc/yurex.c
index 73ebfa6e9715..c640f98d20c5 100644
--- a/drivers/usb/misc/yurex.c
+++ b/drivers/usb/misc/yurex.c
@@ -496,6 +496,9 @@ static ssize_t yurex_write(struct file *file, const char __user *user_buffer,
timeout = schedule_timeout(YUREX_WRITE_TIMEOUT);
finish_wait(&dev->waitq, &wait);
+ /* make sure URB is idle after timeout or (spurious) CMD_ACK */
+ usb_kill_urb(dev->cntl_urb);
+
mutex_unlock(&dev->io_mutex);
if (retval < 0) {
--
2.26.2