Sometimes, firmware may expose interleaved memory layout like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
In that case, we can see memory blocks assigned to multiple nodes in sysfs:
$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
The same applies in the node's directory with a memory21 link in both the
node1 and node2's directory.
This is wrong but doesn't prevent the system to run. However when later,
one of these memory blocks is hot-unplugged and then hot-plugged, the
system is detecting an inconsistency in the sysfs layout and a BUG_ON() is
raised:
------------[ cut here ]------------
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP: c000000000403f34 LR: c000000000403f2c CTR: 0000000000000000
REGS: c0000004876e3660 TRAP: 0700 Not tainted (5.9.0-rc1+)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24000448 XER: 20040000
CFAR: c000000000846d20 IRQMASK: 0
GPR00: c000000000403f2c c0000004876e38f0 c0000000012f6f00 ffffffffffffffef
GPR04: 0000000000000227 c0000004805ae680 0000000000000000 00000004886f0000
GPR08: 0000000000000226 0000000000000003 0000000000000002 fffffffffffffffd
GPR12: 0000000088000484 c00000001ec96280 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000004 0000000000000003
GPR20: c00000047814ffe0 c0000007ffff7c08 0000000000000010 c0000000013332c8
GPR24: 0000000000000000 c0000000011f6cc0 0000000000000000 0000000000000000
GPR28: ffffffffffffffef 0000000000000001 0000000150000000 0000000010000000
NIP [c000000000403f34] add_memory_resource+0x244/0x340
LR [c000000000403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c0000004876e38f0] [c000000000403f2c] add_memory_resource+0x23c/0x340 (unreliable)
[c0000004876e39c0] [c00000000040408c] __add_memory+0x5c/0xf0
[c0000004876e39f0] [c0000000000e2b94] dlpar_add_lmb+0x1b4/0x500
[c0000004876e3ad0] [c0000000000e3888] dlpar_memory+0x1f8/0xb80
[c0000004876e3b60] [c0000000000dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c0000004876e3bd0] [c0000000000dc398] dlpar_store+0x198/0x4a0
[c0000004876e3c90] [c00000000072e630] kobj_attr_store+0x30/0x50
[c0000004876e3cb0] [c00000000051f954] sysfs_kf_write+0x64/0x90
[c0000004876e3cd0] [c00000000051ee40] kernfs_fop_write+0x1b0/0x290
[c0000004876e3d20] [c000000000438dd8] vfs_write+0xe8/0x290
[c0000004876e3d70] [c0000000004391ac] ksys_write+0xdc/0x130
[c0000004876e3dc0] [c000000000034e40] system_call_exception+0x160/0x270
[c0000004876e3e20] [c00000000000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 60000000 0b030000 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 60000000 7c7c1b78 <0b030000> 7f23cb78 4bda371d 60000000
---[ end trace 562fd6c109cd0fb2 ]---
This has been seen on PowerPC LPAR.
The root cause of this issue is that when node's memory is registered, the
range used can overlap another node's range, thus the memory block is
registered to multiple nodes in sysfs.
There are 2 issues here:
a. The sysfs memory and node's layouts are broken due to these multiple
links
b. The link errors in link_mem_sections() should not lead to a system
panic.
To address a. register_mem_sect_under_node should not rely on the system
state to detect whether the link operation is triggered by a hot plug
operation or not. This is addressed by the patches 1 and 2 of this series.
The patch 3 is addressing the point b.
Thanks,
Laurent
Since v2:
- Address David's comments
- Fix stupid build errors in patch 1
Since v1:
- change context enum's name from Michal's comment
- use 2 callbacks in link_mem_sections from David's comment
- use dev_err_ratelimited from Greg's comment
Laurent Dufour (3):
mm: replace memmap_context by memplug_context
mm: don't rely on system state to detect hot-plug operations
mm: don't panic when links can't be created in sysfs
arch/ia64/mm/init.c | 6 +--
drivers/base/node.c | 98 ++++++++++++++++++++++++++++--------------
include/linux/mm.h | 2 +-
include/linux/mmzone.h | 11 +++--
include/linux/node.h | 13 +++---
mm/memory_hotplug.c | 6 +--
mm/page_alloc.c | 10 ++---
7 files changed, 93 insertions(+), 53 deletions(-)
--
2.28.0
This is a note to let you know that I've just added the patch titled
ehci-hcd: Move include to keep CRC stable
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 29231826f3bd65500118c473fccf31c0cf14dbc0 Mon Sep 17 00:00:00 2001
From: Quentin Perret <qperret(a)google.com>
Date: Wed, 16 Sep 2020 18:18:25 +0100
Subject: ehci-hcd: Move include to keep CRC stable
The CRC calculation done by genksyms is triggered when the parser hits
EXPORT_SYMBOL*() macros. At this point, genksyms recursively expands the
types of the function parameters, and uses that as the input for the CRC
calculation. In the case of forward-declared structs, the type expands
to 'UNKNOWN'. Following this, it appears that the result of the
expansion of each type is cached somewhere, and seems to be re-used
when/if the same type is seen again for another exported symbol in the
same C file.
Unfortunately, this can cause CRC 'stability' issues when a struct
definition becomes visible in the middle of a C file. For example, let's
assume code with the following pattern:
struct foo;
int bar(struct foo *arg)
{
/* Do work ... */
}
EXPORT_SYMBOL_GPL(bar);
/* This contains struct foo's definition */
#include "foo.h"
int baz(struct foo *arg)
{
/* Do more work ... */
}
EXPORT_SYMBOL_GPL(baz);
Here, baz's CRC will be computed using the expansion of struct foo that
was cached after bar's CRC calculation ('UNKOWN' here). But if
EXPORT_SYMBOL_GPL(bar) is removed from the file (because of e.g. symbol
trimming using CONFIG_TRIM_UNUSED_KSYMS), struct foo will be expanded
late, during baz's CRC calculation, which now has visibility over the
full struct definition, hence resulting in a different CRC for baz.
The proper fix for this certainly is in genksyms, but that will take me
some time to get right. In the meantime, we have seen one occurrence of
this in the ehci-hcd code which hits this problem because of the way it
includes C files halfway through the code together with an unlucky mix
of symbol trimming.
In order to workaround this, move the include done in ehci-hub.c early
in ehci-hcd.c, hence making sure the struct definitions are visible to
the entire file. This improves CRC stability of the ehci-hcd exports
even when symbol trimming is enabled.
Acked-by: Alan Stern <stern(a)rowland.harvard.edu>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Quentin Perret <qperret(a)google.com>
Link: https://lore.kernel.org/r/20200916171825.3228122-1-qperret@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/host/ehci-hcd.c | 1 +
drivers/usb/host/ehci-hub.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 6257be4110ca..3575b7201881 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -22,6 +22,7 @@
#include <linux/interrupt.h>
#include <linux/usb.h>
#include <linux/usb/hcd.h>
+#include <linux/usb/otg.h>
#include <linux/moduleparam.h>
#include <linux/dma-mapping.h>
#include <linux/debugfs.h>
diff --git a/drivers/usb/host/ehci-hub.c b/drivers/usb/host/ehci-hub.c
index ce0eaf7d7c12..087402aec5cb 100644
--- a/drivers/usb/host/ehci-hub.c
+++ b/drivers/usb/host/ehci-hub.c
@@ -14,7 +14,6 @@
*/
/*-------------------------------------------------------------------------*/
-#include <linux/usb/otg.h>
#define PORT_WAKE_BITS (PORT_WKOC_E|PORT_WKDISC_E|PORT_WKCONN_E)
--
2.28.0
On Wed, Sep 16, 2020 at 11:48:42PM +0100, Andrew Cooper wrote:
> Every day is a school day.
Tell me about it...
> This is very definitely one to be filed in obscure x86 corner cases.
>
> The code snippet above is actually wrong for the kernel, as it uses one
> slot of the red-zone. Recompiling with -mno-red-zone makes something
> which looks safe stack-wise, give or take this behaviour.
Right, we recently disabled red zone in the early decompression stage,
for SEV-ES:
https://git.kernel.org/tip/6ba0efa46047936afa81460489cfd24bc95dd863
I probably should go audit that for similar funnies:
$ objdump -d arch/x86/boot/compressed/vmlinux | grep -E "pop.*\(%[er]?sp"
$
Nope, nothing. Because building your snippet with:
$ gcc -Wall -O2 -mno-red-zone -o flags{,.c}
still does use that one slot:
0000000000001050 <main>:
1050: 48 83 ec 18 sub $0x18,%rsp
1054: 48 8d 3d a9 0f 00 00 lea 0xfa9(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
105b: 31 c0 xor %eax,%eax
105d: 9c pushfq
105e: 8f 44 24 08 popq 0x8(%rsp)
1062: 48 8b 74 24 08 mov 0x8(%rsp),%rsi
Wonder if that flag -mno-red-zone even does anything...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette