Sachin reports [1] a crash in SLUB __slab_alloc():
BUG: Kernel NULL pointer dereference on read at 0x000073b0
Faulting instruction address: 0xc0000000003d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000
CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
LR [c0000000003d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
[c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
[c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
[c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
[c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
[c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
[c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
[c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
[c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
[c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
This is a PowerPC platform with following NUMA topology:
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node 0 1
0: 10 40
1: 40 10
possible numa nodes: 0-31
This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on
appropriate NUMA node" [2] which effectively calls kmalloc_node for each
possible node. SLUB however only allocates kmem_cache_node on online
N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node
for other nodes since commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"). This is however not
true in this configuration where the _node_numa_mem_ array is not initialized
for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
accessing non-allocated kmem_cache_node.
A related issue was reported by Bharata (originally by Ramachandran) [3] where
a similar PowerPC configuration, but with mainline kernel without patch [2]
ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems
to have the same underlying issue with node_to_mem_node() not behaving as
expected, and might probably also lead to an infinite loop with
CONFIG_SLUB_CPU_PARTIAL [4].
This patch should fix both issues by not relying on node_to_mem_node() anymore
and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
attempted for a node that's not online, or has no usable memory. The "usable
memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY
node state, as that is exactly the condition that SLUB uses to allocate
kmem_cache_node structures. The check in get_partial() is removed completely,
as the checks in ___slab_alloc() are now sufficient to prevent get_partial()
being reached with an invalid node.
[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@lin…
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtu…
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.…
Reported-and-tested-by: Sachin Sant <sachinp(a)linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy(a)in.ibm.com>
Tested-by: Bharata B Rao <bharata(a)linux.ibm.com>
Debugged-by: Srikar Dronamraju <srikar(a)linux.vnet.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Cc: stable(a)vger.kernel.org
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Christopher Lameter <cl(a)linux.com>
Cc: linuxppc-dev(a)lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Nathan Lynch <nathanl(a)linux.ibm.com>
---
Hi, after the RFC got positive test results from reporters, I'm resending
as a proper patch.
Thanks,
Vlastimil
mm/slub.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..7113b1f9cd77 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
- else if (!node_present_pages(node))
- searchnode = node_to_mem_node(node);
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
struct page *page;
page = c->page;
- if (!page)
+ if (!page) {
+ /*
+ * if the node is not online or has no normal memory, just
+ * ignore the node constraint
+ */
+ if (unlikely(node != NUMA_NO_NODE &&
+ !node_state(node, N_NORMAL_MEMORY)))
+ node = NUMA_NO_NODE;
goto new_slab;
+ }
redo:
if (unlikely(!node_match(page, node))) {
- int searchnode = node;
-
- if (node != NUMA_NO_NODE && !node_present_pages(node))
- searchnode = node_to_mem_node(node);
-
- if (unlikely(!node_match(page, searchnode))) {
+ /*
+ * same as above but node_match() being false already
+ * implies node != NUMA_NO_NODE
+ */
+ if (!node_state(node, N_NORMAL_MEMORY)) {
+ node = NUMA_NO_NODE;
+ goto redo;
+ } else {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
--
2.25.1
When a cfs rq is throttled, the latter and its child are removed from the
leaf list but their nr_running is not changed which includes staying higher
than 1. When a task is enqueued in this throttled branch, the cfs rqs must
be added back in order to ensure correct ordering in the list but this can
only happens if nr_running == 1.
When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq()
when enqueuing an entity to make sure that the complete branch will be
added.
Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent
is throttled. Iterate the remaining entity to ensure that the complete
branch will be added in the list.
Reported-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: stable(a)vger.kernel.org #v5.1+
Signed-off-by: Vincent Guittot <vincent.guittot(a)linaro.org>
---
Change since v1:
- added the iteration at the end of unthrottle_cfs_rq() as Dietmar
reported use case which triggered assert_list_leaf_cfs_rq()
kernel/sched/fair.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc968669aea..ae50a09cbead 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4117,6 +4117,7 @@ static inline void check_schedstat_required(void)
#endif
}
+static inline bool cfs_bandwidth_used(void);
/*
* MIGRATION
@@ -4195,10 +4196,16 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;
- if (cfs_rq->nr_running == 1) {
+ /*
+ * When bandwidth control is enabled, cfs might have been removed
+ * because of a parent been throttled but cfs->nr_running > 1. Try to
+ * add it unconditionnally.
+ */
+ if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
list_add_leaf_cfs_rq(cfs_rq);
+
+ if (cfs_rq->nr_running == 1)
check_enqueue_throttle(cfs_rq);
- }
}
static void __clear_buddies_last(struct sched_entity *se)
@@ -4779,11 +4786,22 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}
- assert_list_leaf_cfs_rq(rq);
-
if (!se)
add_nr_running(rq, task_delta);
+ /*
+ * The cfs_rq_throttled() breaks in the above iteration can result in
+ * incomplete leaf list maintenance, resulting in triggering the
+ * assertion below.
+ */
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+
+ list_add_leaf_cfs_rq(cfs_rq);
+ }
+
+ assert_list_leaf_cfs_rq(rq);
+
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
resched_curr(rq);
--
2.17.1
Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: c0ef7f763513 - Linux 5.5.11-rc2
The results of these automated tests are provided below.
Overall result: FAILED (see details below)
Merge: OK
Compile: FAILED
All kernel binaries, config files, and logs are available for download here:
https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=dataware…
We attempted to compile the kernel for multiple architectures, but the compile
failed on one or more architectures:
aarch64: FAILED (see build-aarch64.log.xz attachment)
We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
s390x:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: e9842f9e58e1 - Linux 5.5.11-rc1
The results of these automated tests are provided below.
Overall result: FAILED (see details below)
Merge: OK
Compile: FAILED
All kernel binaries, config files, and logs are available for download here:
https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=dataware…
We attempted to compile the kernel for multiple architectures, but the compile
failed on one or more architectures:
aarch64: FAILED (see build-aarch64.log.xz attachment)
We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
s390x:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
The patch below does not apply to the 5.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1292e3efb149ee21d8d33d725eeed4e6b1ade963 Mon Sep 17 00:00:00 2001
From: Ulf Hansson <ulf.hansson(a)linaro.org>
Date: Tue, 10 Mar 2020 12:49:43 +0100
Subject: [PATCH] mmc: core: Allow host controllers to require R1B for CMD6
It has turned out that some host controllers can't use R1B for CMD6 and
other commands that have R1B associated with them. Therefore invent a new
host cap, MMC_CAP_NEED_RSP_BUSY to let them specify this.
In __mmc_switch(), let's check the flag and use it to prevent R1B responses
from being converted into R1. Note that, this also means that the host are
on its own, when it comes to manage the busy timeout.
Suggested-by: Sowjanya Komatineni <skomatineni(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Tested-by: Anders Roxell <anders.roxell(a)linaro.org>
Tested-by: Sowjanya Komatineni <skomatineni(a)nvidia.com>
Tested-by: Faiz Abbas <faiz_abbas(a)ti.com>
Tested-By: Peter Geis <pgwipeout(a)gmail.com>
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index da425ee2d9bf..e025604e17d4 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -542,9 +542,11 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
* If the max_busy_timeout of the host is specified, make sure it's
* enough to fit the used timeout_ms. In case it's not, let's instruct
* the host to avoid HW busy detection, by converting to a R1 response
- * instead of a R1B.
+ * instead of a R1B. Note, some hosts requires R1B, which also means
+ * they are on their own when it comes to deal with the busy timeout.
*/
- if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
+ if (!(host->caps & MMC_CAP_NEED_RSP_BUSY) && host->max_busy_timeout &&
+ (timeout_ms > host->max_busy_timeout))
use_r1b_resp = false;
cmd.opcode = MMC_SWITCH;
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index ba703384bea0..4c5eb3aa8e72 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -333,6 +333,7 @@ struct mmc_host {
MMC_CAP_UHS_SDR50 | MMC_CAP_UHS_SDR104 | \
MMC_CAP_UHS_DDR50)
#define MMC_CAP_SYNC_RUNTIME_PM (1 << 21) /* Synced runtime PM suspends. */
+#define MMC_CAP_NEED_RSP_BUSY (1 << 22) /* Commands with R1B can't use R1. */
#define MMC_CAP_DRIVER_TYPE_A (1 << 23) /* Host supports Driver Type A */
#define MMC_CAP_DRIVER_TYPE_C (1 << 24) /* Host supports Driver Type C */
#define MMC_CAP_DRIVER_TYPE_D (1 << 25) /* Host supports Driver Type D */