The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5812b32e01c6d86ba7a84110702b46d8a8531fe9 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Mon, 23 Nov 2020 11:23:12 +0100
Subject: [PATCH] of: fix linker-section match-table corruption
Specify type alignment when declaring linker-section match-table entries
to prevent gcc from increasing alignment and corrupting the various
tables with padding (e.g. timers, irqchips, clocks, reserved memory).
This is specifically needed on x86 where gcc (typically) aligns larger
objects like struct of_device_id with static extent on 32-byte
boundaries which at best prevents matching on anything but the first
entry. Specifying alignment when declaring variables suppresses this
optimisation.
Here's a 64-bit example where all entries are corrupt as 16 bytes of
padding has been inserted before the first entry:
ffffffff8266b4b0 D __clk_of_table
ffffffff8266b4c0 d __of_table_fixed_factor_clk
ffffffff8266b5a0 d __of_table_fixed_clk
ffffffff8266b680 d __clk_of_table_sentinel
And here's a 32-bit example where the 8-byte-aligned table happens to be
placed on a 32-byte boundary so that all but the first entry are corrupt
due to the 28 bytes of padding inserted between entries:
812b3ec0 D __irqchip_of_table
812b3ec0 d __of_table_irqchip1
812b3fa0 d __of_table_irqchip2
812b4080 d __of_table_irqchip3
812b4160 d irqchip_of_match_end
Verified on x86 using gcc-9.3 and gcc-4.9 (which uses 64-byte
alignment), and on arm using gcc-7.2.
Note that there are no in-tree users of these tables on x86 currently
(even if they are included in the image).
Fixes: 54196ccbe0ba ("of: consolidate linker section OF match table declarations")
Fixes: f6e916b82022 ("irqchip: add basic infrastructure")
Cc: stable <stable(a)vger.kernel.org> # 3.9
Signed-off-by: Johan Hovold <johan(a)kernel.org>
Link: https://lore.kernel.org/r/20201123102319.8090-2-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/include/linux/of.h b/include/linux/of.h
index 5d51891cbf1a..af655d264f10 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -1300,6 +1300,7 @@ static inline int of_get_available_child_count(const struct device_node *np)
#define _OF_DECLARE(table, name, compat, fn, fn_type) \
static const struct of_device_id __of_table_##name \
__used __section("__" #table "_of_table") \
+ __aligned(__alignof__(struct of_device_id)) \
= { .compatible = compat, \
.data = (fn == (fn_type)NULL) ? fn : fn }
#else
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5812b32e01c6d86ba7a84110702b46d8a8531fe9 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Mon, 23 Nov 2020 11:23:12 +0100
Subject: [PATCH] of: fix linker-section match-table corruption
Specify type alignment when declaring linker-section match-table entries
to prevent gcc from increasing alignment and corrupting the various
tables with padding (e.g. timers, irqchips, clocks, reserved memory).
This is specifically needed on x86 where gcc (typically) aligns larger
objects like struct of_device_id with static extent on 32-byte
boundaries which at best prevents matching on anything but the first
entry. Specifying alignment when declaring variables suppresses this
optimisation.
Here's a 64-bit example where all entries are corrupt as 16 bytes of
padding has been inserted before the first entry:
ffffffff8266b4b0 D __clk_of_table
ffffffff8266b4c0 d __of_table_fixed_factor_clk
ffffffff8266b5a0 d __of_table_fixed_clk
ffffffff8266b680 d __clk_of_table_sentinel
And here's a 32-bit example where the 8-byte-aligned table happens to be
placed on a 32-byte boundary so that all but the first entry are corrupt
due to the 28 bytes of padding inserted between entries:
812b3ec0 D __irqchip_of_table
812b3ec0 d __of_table_irqchip1
812b3fa0 d __of_table_irqchip2
812b4080 d __of_table_irqchip3
812b4160 d irqchip_of_match_end
Verified on x86 using gcc-9.3 and gcc-4.9 (which uses 64-byte
alignment), and on arm using gcc-7.2.
Note that there are no in-tree users of these tables on x86 currently
(even if they are included in the image).
Fixes: 54196ccbe0ba ("of: consolidate linker section OF match table declarations")
Fixes: f6e916b82022 ("irqchip: add basic infrastructure")
Cc: stable <stable(a)vger.kernel.org> # 3.9
Signed-off-by: Johan Hovold <johan(a)kernel.org>
Link: https://lore.kernel.org/r/20201123102319.8090-2-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/include/linux/of.h b/include/linux/of.h
index 5d51891cbf1a..af655d264f10 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -1300,6 +1300,7 @@ static inline int of_get_available_child_count(const struct device_node *np)
#define _OF_DECLARE(table, name, compat, fn, fn_type) \
static const struct of_device_id __of_table_##name \
__used __section("__" #table "_of_table") \
+ __aligned(__alignof__(struct of_device_id)) \
= { .compatible = compat, \
.data = (fn == (fn_type)NULL) ? fn : fn }
#else
From: Johannes Weiner <hannes(a)cmpxchg.org>
[ Upstream commit a983b5ebee57209c99f68c8327072f25e0e6e3da ]
We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.
Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy. The complexity is this:
nr_cgroups * nr_stat_items * nr_possible_cpus
where the stat items are ~70 at this point. With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters. This doesn't scale.
Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.
This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.
Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error. That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is
true for the memory controller.
Use the same static batch size we already use for page_counter updates
during charging. The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.
[hannes(a)cmpxchg.org: fix warning in __this_cpu_xchg() calls]
Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: stable(a)vger.kernel.org c9019e9: mm: memcontrol: eliminate raw access to stat and event counters
Cc: stable(a)vger.kernel.org 2845426: mm: memcontrol: implement lruvec stat functions on top of each other
Cc: stable(a)vger.kernel.org
[shaoyi(a)amazon.com: resolved the conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 in mm/memcontrol.c by contextual fix]
Signed-off-by: Shaoying Xu <shaoyi(a)amazon.com>
---
The excessive complexity in memory.stat reporting was fixed in v4.16 but didn't appear to make it to 4.14 stable. When backporting this patch, there is a small conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 within free_mem_cgroup_per_node_info() of mm/memcontrol.c and can be resolved by contextual fix.
include/linux/memcontrol.h | 96 +++++++++++++++++++++++++++---------------
mm/memcontrol.c | 101 +++++++++++++++++++++++----------------------
2 files changed, 113 insertions(+), 84 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1ffc54ac4cc9..882046863581 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -108,7 +108,10 @@ struct lruvec_stat {
*/
struct mem_cgroup_per_node {
struct lruvec lruvec;
- struct lruvec_stat __percpu *lruvec_stat;
+
+ struct lruvec_stat __percpu *lruvec_stat_cpu;
+ atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
+
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
@@ -227,10 +230,10 @@ struct mem_cgroup {
spinlock_t move_lock;
struct task_struct *move_lock_task;
unsigned long move_lock_flags;
- /*
- * percpu counter.
- */
- struct mem_cgroup_stat_cpu __percpu *stat;
+
+ struct mem_cgroup_stat_cpu __percpu *stat_cpu;
+ atomic_long_t stat[MEMCG_NR_STAT];
+ atomic_long_t events[MEMCG_NR_EVENTS];
unsigned long socket_pressure;
@@ -265,6 +268,12 @@ struct mem_cgroup {
/* WARNING: nodeinfo must be the last member here */
};
+/*
+ * size of first charge trial. "32" comes from vmscan.c's magic value.
+ * TODO: maybe necessary to use big numbers in big irons.
+ */
+#define MEMCG_CHARGE_BATCH 32U
+
extern struct mem_cgroup *root_mem_cgroup;
static inline bool mem_cgroup_disabled(void)
@@ -485,32 +494,38 @@ void unlock_page_memcg(struct page *page);
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
int idx)
{
- long val = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- val += per_cpu(memcg->stat->count[idx], cpu);
-
- if (val < 0)
- val = 0;
-
- return val;
+ long x = atomic_long_read(&memcg->stat[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void __mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
{
- if (!mem_cgroup_disabled())
- __this_cpu_add(memcg->stat->count[idx], val);
+ long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = val + __this_cpu_read(memcg->stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->stat_cpu->count[idx], x);
}
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
{
- if (!mem_cgroup_disabled())
- this_cpu_add(memcg->stat->count[idx], val);
+ preempt_disable();
+ __mod_memcg_state(memcg, idx, val);
+ preempt_enable();
}
/**
@@ -548,26 +563,25 @@ static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long val = 0;
- int cpu;
+ long x;
if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- for_each_possible_cpu(cpu)
- val += per_cpu(pn->lruvec_stat->count[idx], cpu);
-
- if (val < 0)
- val = 0;
-
- return val;
+ x = atomic_long_read(&pn->lruvec_stat[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline void __mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
struct mem_cgroup_per_node *pn;
+ long x;
/* Update node */
__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
@@ -581,7 +595,12 @@ static inline void __mod_lruvec_state(struct lruvec *lruvec,
__mod_memcg_state(pn->memcg, idx, val);
/* Update lruvec */
- __this_cpu_add(pn->lruvec_stat->count[idx], val);
+ x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &pn->lruvec_stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
}
static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -624,16 +643,25 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
static inline void __count_memcg_events(struct mem_cgroup *memcg,
int idx, unsigned long count)
{
- if (!mem_cgroup_disabled())
- __this_cpu_add(memcg->stat->events[idx], count);
+ unsigned long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = count + __this_cpu_read(memcg->stat_cpu->events[idx]);
+ if (unlikely(x > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->events[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->stat_cpu->events[idx], x);
}
-/* idx can be of type enum memcg_event_item or vm_event_item */
static inline void count_memcg_events(struct mem_cgroup *memcg,
int idx, unsigned long count)
{
- if (!mem_cgroup_disabled())
- this_cpu_add(memcg->stat->events[idx], count);
+ preempt_disable();
+ __count_memcg_events(memcg, idx, count);
+ preempt_enable();
}
/* idx can be of type enum memcg_event_item or vm_event_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eba9dc4795b5..4e763cdccb33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -542,39 +542,10 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}
-/*
- * Return page count for single (non recursive) @memcg.
- *
- * Implementation Note: reading percpu statistics for memcg.
- *
- * Both of vmstat[] and percpu_counter has threshold and do periodic
- * synchronization to implement "quick" read. There are trade-off between
- * reading cost and precision of value. Then, we may have a chance to implement
- * a periodic synchronization of counter in memcg's counter.
- *
- * But this _read() function is used for user interface now. The user accounts
- * memory usage by memory cgroup and he _always_ requires exact value because
- * he accounts memory. Even if we provide quick-and-fuzzy read, we always
- * have to visit all online cpus and make sum. So, for now, unnecessary
- * synchronization is not implemented. (just implemented for cpu hotplug)
- *
- * If there are kernel internal actions which can make use of some not-exact
- * value, and reading all cpu value can be performance bottleneck in some
- * common workload, threshold and synchronization as vmstat[] should be
- * implemented.
- *
- * The parameter idx can be of type enum memcg_event_item or vm_event_item.
- */
-
static unsigned long memcg_sum_events(struct mem_cgroup *memcg,
int event)
{
- unsigned long val = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- val += per_cpu(memcg->stat->events[event], cpu);
- return val;
+ return atomic_long_read(&memcg->events[event]);
}
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -606,7 +577,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
nr_pages = -nr_pages; /* for event */
}
- __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+ __this_cpu_add(memcg->stat_cpu->nr_page_events, nr_pages);
}
unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
@@ -642,8 +613,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
{
unsigned long val, next;
- val = __this_cpu_read(memcg->stat->nr_page_events);
- next = __this_cpu_read(memcg->stat->targets[target]);
+ val = __this_cpu_read(memcg->stat_cpu->nr_page_events);
+ next = __this_cpu_read(memcg->stat_cpu->targets[target]);
/* from time_after() in jiffies.h */
if ((long)(next - val) < 0) {
switch (target) {
@@ -659,7 +630,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
default:
break;
}
- __this_cpu_write(memcg->stat->targets[target], next);
+ __this_cpu_write(memcg->stat_cpu->targets[target], next);
return true;
}
return false;
@@ -1726,11 +1697,6 @@ void unlock_page_memcg(struct page *page)
}
EXPORT_SYMBOL(unlock_page_memcg);
-/*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
- */
-#define CHARGE_BATCH 32U
struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
@@ -1758,7 +1724,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
unsigned long flags;
bool ret = false;
- if (nr_pages > CHARGE_BATCH)
+ if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
local_irq_save(flags);
@@ -1827,7 +1793,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
}
stock->nr_pages += nr_pages;
- if (stock->nr_pages > CHARGE_BATCH)
+ if (stock->nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
local_irq_restore(flags);
@@ -1877,9 +1843,44 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_stock_pcp *stock;
+ struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
+
+ for_each_mem_cgroup(memcg) {
+ int i;
+
+ for (i = 0; i < MEMCG_NR_STAT; i++) {
+ int nid;
+ long x;
+
+ x = this_cpu_xchg(memcg->stat_cpu->count[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->stat[i]);
+
+ if (i >= NR_VM_NODE_STAT_ITEMS)
+ continue;
+
+ for_each_node(nid) {
+ struct mem_cgroup_per_node *pn;
+
+ pn = mem_cgroup_nodeinfo(memcg, nid);
+ x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
+ if (x)
+ atomic_long_add(x, &pn->lruvec_stat[i]);
+ }
+ }
+
+ for (i = 0; i < MEMCG_NR_EVENTS; i++) {
+ long x;
+
+ x = this_cpu_xchg(memcg->stat_cpu->events[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->events[i]);
+ }
+ }
+
return 0;
}
@@ -1900,7 +1901,7 @@ static void high_work_func(struct work_struct *work)
struct mem_cgroup *memcg;
memcg = container_of(work, struct mem_cgroup, high_work);
- reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+ reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}
/*
@@ -1924,7 +1925,7 @@ void mem_cgroup_handle_over_high(void)
static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
- unsigned int batch = max(CHARGE_BATCH, nr_pages);
+ unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
struct page_counter *counter;
@@ -4203,8 +4204,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;
- pn->lruvec_stat = alloc_percpu(struct lruvec_stat);
- if (!pn->lruvec_stat) {
+ pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+ if (!pn->lruvec_stat_cpu) {
kfree(pn);
return 1;
}
@@ -4225,7 +4226,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return;
- free_percpu(pn->lruvec_stat);
+ free_percpu(pn->lruvec_stat_cpu);
kfree(pn);
}
@@ -4235,7 +4236,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
- free_percpu(memcg->stat);
+ free_percpu(memcg->stat_cpu);
kfree(memcg);
}
@@ -4264,8 +4265,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg->id.id < 0)
goto fail;
- memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
- if (!memcg->stat)
+ memcg->stat_cpu = alloc_percpu(struct mem_cgroup_stat_cpu);
+ if (!memcg->stat_cpu)
goto fail;
for_each_node(node)
@@ -5686,7 +5687,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+ __this_cpu_add(ug->memcg->stat_cpu->nr_page_events, nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);
--
2.16.6
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9996bd494794a2fe393e97e7a982388c6249aa76 Mon Sep 17 00:00:00 2001
From: SeongJae Park <sjpark(a)amazon.de>
Date: Mon, 14 Dec 2020 10:08:40 +0100
Subject: [PATCH] xenbus/xenbus_backend: Disallow pending watch messages
'xenbus_backend' watches 'state' of devices, which is writable by
guests. Hence, if guests intensively updates it, dom0 will have lots of
pending events that exhausting memory of dom0. In other words, guests
can trigger dom0 memory pressure. This is known as XSA-349. However,
the watch callback of it, 'frontend_changed()', reads only 'state', so
doesn't need to have the pending events.
To avoid the problem, this commit disallows pending watch messages for
'xenbus_backend' using the 'will_handle()' watch callback.
This is part of XSA-349
Cc: stable(a)vger.kernel.org
Signed-off-by: SeongJae Park <sjpark(a)amazon.de>
Reported-by: Michael Kurth <mku(a)amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel(a)amazon.de>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/xenbus/xenbus_probe_backend.c b/drivers/xen/xenbus/xenbus_probe_backend.c
index 2ba699897e6d..5abded97e1a7 100644
--- a/drivers/xen/xenbus/xenbus_probe_backend.c
+++ b/drivers/xen/xenbus/xenbus_probe_backend.c
@@ -180,6 +180,12 @@ static int xenbus_probe_backend(struct xen_bus_type *bus, const char *type,
return err;
}
+static bool frontend_will_handle(struct xenbus_watch *watch,
+ const char *path, const char *token)
+{
+ return watch->nr_pending == 0;
+}
+
static void frontend_changed(struct xenbus_watch *watch,
const char *path, const char *token)
{
@@ -191,6 +197,7 @@ static struct xen_bus_type xenbus_backend = {
.levels = 3, /* backend/type/<frontend>/<id> */
.get_bus_id = backend_bus_id,
.probe = xenbus_probe_backend,
+ .otherend_will_handle = frontend_will_handle,
.otherend_changed = frontend_changed,
.bus = {
.name = "xen-backend",
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3dc86ca6b4c8cfcba9da7996189d1b5a358a94fc Mon Sep 17 00:00:00 2001
From: SeongJae Park <sjpark(a)amazon.de>
Date: Mon, 14 Dec 2020 10:07:13 +0100
Subject: [PATCH] xen/xenbus: Count pending messages for each watch
This commit adds a counter of pending messages for each watch in the
struct. It is used to skip unnecessary pending messages lookup in
'unregister_xenbus_watch()'. It could also be used in 'will_handle'
callback.
This is part of XSA-349
Cc: stable(a)vger.kernel.org
Signed-off-by: SeongJae Park <sjpark(a)amazon.de>
Reported-by: Michael Kurth <mku(a)amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel(a)amazon.de>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index e8bdbd0a1e26..12e02eb01f59 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -711,6 +711,7 @@ int xs_watch_msg(struct xs_watch_event *event)
event->path, event->token))) {
spin_lock(&watch_events_lock);
list_add_tail(&event->list, &watch_events);
+ event->handle->nr_pending++;
wake_up(&watch_events_waitq);
spin_unlock(&watch_events_lock);
} else
@@ -768,6 +769,8 @@ int register_xenbus_watch(struct xenbus_watch *watch)
sprintf(token, "%lX", (long)watch);
+ watch->nr_pending = 0;
+
down_read(&xs_watch_rwsem);
spin_lock(&watches_lock);
@@ -817,11 +820,14 @@ void unregister_xenbus_watch(struct xenbus_watch *watch)
/* Cancel pending watch events. */
spin_lock(&watch_events_lock);
- list_for_each_entry_safe(event, tmp, &watch_events, list) {
- if (event->handle != watch)
- continue;
- list_del(&event->list);
- kfree(event);
+ if (watch->nr_pending) {
+ list_for_each_entry_safe(event, tmp, &watch_events, list) {
+ if (event->handle != watch)
+ continue;
+ list_del(&event->list);
+ kfree(event);
+ }
+ watch->nr_pending = 0;
}
spin_unlock(&watch_events_lock);
@@ -868,7 +874,6 @@ void xs_suspend_cancel(void)
static int xenwatch_thread(void *unused)
{
- struct list_head *ent;
struct xs_watch_event *event;
xenwatch_pid = current->pid;
@@ -883,13 +888,15 @@ static int xenwatch_thread(void *unused)
mutex_lock(&xenwatch_mutex);
spin_lock(&watch_events_lock);
- ent = watch_events.next;
- if (ent != &watch_events)
- list_del(ent);
+ event = list_first_entry_or_null(&watch_events,
+ struct xs_watch_event, list);
+ if (event) {
+ list_del(&event->list);
+ event->handle->nr_pending--;
+ }
spin_unlock(&watch_events_lock);
- if (ent != &watch_events) {
- event = list_entry(ent, struct xs_watch_event, list);
+ if (event) {
event->handle->callback(event->handle, event->path,
event->token);
kfree(event);
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index c8574d1b814c..00c7235ae93e 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -61,6 +61,8 @@ struct xenbus_watch
/* Path being watched. */
const char *node;
+ unsigned int nr_pending;
+
/*
* Called just before enqueing new event while a spinlock is held.
* The event will be discarded if this callback returns false.
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1c728719a4da6e654afb9cc047164755072ed7c9 Mon Sep 17 00:00:00 2001
From: Pawel Wieczorkiewicz <wipawel(a)amazon.de>
Date: Mon, 14 Dec 2020 10:25:57 +0100
Subject: [PATCH] xen-blkback: set ring->xenblkd to NULL after kthread_stop()
When xen_blkif_disconnect() is called, the kernel thread behind the
block interface is stopped by calling kthread_stop(ring->xenblkd).
The ring->xenblkd thread pointer being non-NULL determines if the
thread has been already stopped.
Normally, the thread's function xen_blkif_schedule() sets the
ring->xenblkd to NULL, when the thread's main loop ends.
However, when the thread has not been started yet (i.e.
wake_up_process() has not been called on it), the xen_blkif_schedule()
function would not be called yet.
In such case the kthread_stop() call returns -EINTR and the
ring->xenblkd remains dangling.
When this happens, any consecutive call to xen_blkif_disconnect (for
example in frontend_changed() callback) leads to a kernel crash in
kthread_stop() (e.g. NULL pointer dereference in exit_creds()).
This is XSA-350.
Cc: <stable(a)vger.kernel.org> # 4.12
Fixes: a24fa22ce22a ("xen/blkback: don't use xen_blkif_get() in xen-blkback kthread")
Reported-by: Olivier Benjamin <oliben(a)amazon.com>
Reported-by: Pawel Wieczorkiewicz <wipawel(a)amazon.de>
Signed-off-by: Pawel Wieczorkiewicz <wipawel(a)amazon.de>
Reviewed-by: Julien Grall <jgrall(a)amazon.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 1d8b8d24496c..9860d4842f36 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -274,6 +274,7 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
if (ring->xenblkd) {
kthread_stop(ring->xenblkd);
+ ring->xenblkd = NULL;
wake_up(&ring->shutdown_wq);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From bf8975837dac156c33a4d15d46602700998cb6dd Mon Sep 17 00:00:00 2001
From: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Date: Tue, 24 Nov 2020 12:57:07 +0100
Subject: [PATCH] dma-buf/dma-resv: Respect num_fences when initializing the
shared fence list.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We hardcode the maximum number of shared fences to 4, instead of
respecting num_fences. Use a minimum of 4, but more if num_fences
is higher.
This seems to have been an oversight when first implementing the
api.
Fixes: 04a5faa8cbe5 ("reservation: update api and add some helpers")
Cc: <stable(a)vger.kernel.org> # v3.17+
Reported-by: Niranjana Vishwanathapura <niranjana.vishwanathapura(a)intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201124115707.406917-1-maart…
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index bb5a42b10c29..6ddbeb5dfbf6 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -200,7 +200,7 @@ int dma_resv_reserve_shared(struct dma_resv *obj, unsigned int num_fences)
max = max(old->shared_count + num_fences,
old->shared_max * 2);
} else {
- max = 4;
+ max = max(4ul, roundup_pow_of_two(num_fences));
}
new = dma_resv_list_alloc(max);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From bf8975837dac156c33a4d15d46602700998cb6dd Mon Sep 17 00:00:00 2001
From: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Date: Tue, 24 Nov 2020 12:57:07 +0100
Subject: [PATCH] dma-buf/dma-resv: Respect num_fences when initializing the
shared fence list.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We hardcode the maximum number of shared fences to 4, instead of
respecting num_fences. Use a minimum of 4, but more if num_fences
is higher.
This seems to have been an oversight when first implementing the
api.
Fixes: 04a5faa8cbe5 ("reservation: update api and add some helpers")
Cc: <stable(a)vger.kernel.org> # v3.17+
Reported-by: Niranjana Vishwanathapura <niranjana.vishwanathapura(a)intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201124115707.406917-1-maart…
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index bb5a42b10c29..6ddbeb5dfbf6 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -200,7 +200,7 @@ int dma_resv_reserve_shared(struct dma_resv *obj, unsigned int num_fences)
max = max(old->shared_count + num_fences,
old->shared_max * 2);
} else {
- max = 4;
+ max = max(4ul, roundup_pow_of_two(num_fences));
}
new = dma_resv_list_alloc(max);