The patch titled
Subject: zram: fix race between backing_dev_show and backing_dev_store
has been removed from the -mm tree. Its filename was
zram-fix-race-between-backing_dev_show-and-backing_dev_store.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Chenwandun <chenwandun(a)huawei.com>
Subject: zram: fix race between backing_dev_show and backing_dev_store
CPU0: CPU1:
backing_dev_show backing_dev_store
...... ......
file = zram->backing_dev;
down_read(&zram->init_lock); down_read(&zram->init_init_lock)
file_path(file, ...); zram->backing_dev = backing_dev;
up_read(&zram->init_lock); up_read(&zram->init_lock);
gets the value of zram->backing_dev too early in backing_dev_show, which
resultin the value being NULL at the beginning, and not NULL later.
backtrace:
[<ffffff8570e0f3ec>] d_path+0xcc/0x174
[<ffffff8570decd90>] file_path+0x10/0x18
[<ffffff85712f7630>] backing_dev_show+0x40/0xb4
[<ffffff85712c776c>] dev_attr_show+0x20/0x54
[<ffffff8570e835e4>] sysfs_kf_seq_show+0x9c/0x10c
[<ffffff8570e82b98>] kernfs_seq_show+0x28/0x30
[<ffffff8570e1c580>] seq_read+0x184/0x488
[<ffffff8570e81ec4>] kernfs_fop_read+0x5c/0x1a4
[<ffffff8570dee0fc>] __vfs_read+0x44/0x128
[<ffffff8570dee310>] vfs_read+0xa0/0x138
[<ffffff8570dee860>] SyS_read+0x54/0xb4
Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawe…
Signed-off-by: Chenwandun <chenwandun(a)huawei.com>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/drivers/block/zram/zram_drv.c~zram-fix-race-between-backing_dev_show-and-backing_dev_store
+++ a/drivers/block/zram/zram_drv.c
@@ -413,13 +413,14 @@ static void reset_bdev(struct zram *zram
static ssize_t backing_dev_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
+ struct file *file;
struct zram *zram = dev_to_zram(dev);
- struct file *file = zram->backing_dev;
char *p;
ssize_t ret;
down_read(&zram->init_lock);
- if (!zram->backing_dev) {
+ file = zram->backing_dev;
+ if (!file) {
memcpy(buf, "none\n", 5);
up_read(&zram->init_lock);
return 5;
_
Patches currently in -mm which might be from chenwandun(a)huawei.com are
The patch titled
Subject: ocfs2: fix panic due to ocfs2_wq is null
has been removed from the -mm tree. Its filename was
ocfs2-fix-panic-due-to-ocfs2_wq-is-null.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Yi Li <yilikernel(a)gmail.com>
Subject: ocfs2: fix panic due to ocfs2_wq is null
mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters an
error. ocfs2_initialize_super() returns before allocating ocfs2_wq.
ocfs2_dismount_volume() triggers the following panic.
Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption
discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44):
ocfs2_read_locked_inode:537 ERROR: status = -30
Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44):
ocfs2_init_global_system_inodes:458 ERROR: status = -30
Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44):
ocfs2_init_global_system_inodes:491 ERROR: status = -30
Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44):
ocfs2_initialize_super:2313 ERROR: status = -30
Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44):
ocfs2_fill_super:1033 ERROR: status = -30
------------[ cut here ]------------
Oops: 0002 [#1] SMP NOPTI
Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache
lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4
nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev
parport_pc parport fb_sys_fops sysimgblt sysfillrect syscopyarea
acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom
CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E
4.14.148-200.ckv.x86_64 #1
Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
task: ffff967af0520000 task.stack: ffffa5f05484000
RIP: 0010:mutex_lock+0x19/0x20
Call Trace:
flush_workqueue+0x81/0x460
ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
ocfs2_dismount_volume+0x84/0x400 [ocfs2]
ocfs2_fill_super+0xa4/0x1270 [ocfs2]
? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
mount_bdev+0x17f/0x1c0
mount_fs+0x3a/0x160
Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
Signed-off-by: Yi Li <yilikernel(a)gmail.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/journal.c | 3 ++-
fs/ocfs2/localalloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/journal.c~ocfs2-fix-panic-due-to-ocfs2_wq-is-null
+++ a/fs/ocfs2/journal.c
@@ -217,7 +217,8 @@ void ocfs2_recovery_exit(struct ocfs2_su
/* At this point, we know that no more recovery threads can be
* launched, so wait for any recovery completion work to
* complete. */
- flush_workqueue(osb->ocfs2_wq);
+ if (osb->ocfs2_wq)
+ flush_workqueue(osb->ocfs2_wq);
/*
* Now that recovery is shut down, and the osb is about to be
--- a/fs/ocfs2/localalloc.c~ocfs2-fix-panic-due-to-ocfs2_wq-is-null
+++ a/fs/ocfs2/localalloc.c
@@ -377,7 +377,8 @@ void ocfs2_shutdown_local_alloc(struct o
struct ocfs2_dinode *alloc = NULL;
cancel_delayed_work(&osb->la_enable_wq);
- flush_workqueue(osb->ocfs2_wq);
+ if (osb->ocfs2_wq)
+ flush_workqueue(osb->ocfs2_wq);
if (osb->local_alloc_state == OCFS2_LA_UNUSED)
goto out;
_
Patches currently in -mm which might be from yilikernel(a)gmail.com are
The patch titled
Subject: hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
has been removed from the -mm tree. Its filename was
hugetlbfs-dont-access-uninitialized-memmaps-in-pfn_range_valid_gigantic.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched.
Let's make sure that we only consider online memory (managed by the buddy)
that has initialized memmaps. ZONE_DEVICE is not applicable.
page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and
CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be
the case when an offline memory block (e.g., never onlined) is spanned by
a zone.
Note: As explained by Michal in [1], alloc_contig_range() will verify the
range. So it boils down to the wrong access in this function.
[1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Michal Hocko <mhocko(a)kernel.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/mm/hugetlb.c~hugetlbfs-dont-access-uninitialized-memmaps-in-pfn_range_valid_gigantic
+++ a/mm/hugetlb.c
@@ -1084,11 +1084,10 @@ static bool pfn_range_valid_gigantic(str
struct page *page;
for (i = start_pfn; i < end_pfn; i++) {
- if (!pfn_valid(i))
+ page = pfn_to_online_page(i);
+ if (!page)
return false;
- page = pfn_to_page(i);
-
if (page_zone(page) != z)
return false;
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-memory_hotplug-export-generic_online_page.patch
hv_balloon-use-generic_online_page.patch
mm-memory_hotplug-remove-__online_page_free-and-__online_page_increment_counters.patch
mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_zone_span.patch
mm-memory_hotplug-shrink-zones-when-offlining-memory.patch
mm-memory_hotplug-poison-memmap-in-remove_pfn_range_from_zone.patch
mm-memory_hotplug-we-always-have-a-zone-in-find_smallestbiggest_section_pfn.patch
mm-memory_hotplug-dont-check-for-all-holes-in-shrink_zone_span.patch
mm-memory_hotplug-drop-local-variables-in-shrink_zone_span.patch
mm-memory_hotplug-cleanup-__remove_pages.patch
The patch titled
Subject: mm: memblock: do not enforce current limit for memblock_phys* family
has been removed from the -mm tree. Its filename was
mm-memblock-do-not-enforce-current-limit-for-memblock_phys-family.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Rapoport <rppt(a)linux.ibm.com>
Subject: mm: memblock: do not enforce current limit for memblock_phys* family
Until commit 92d12f9544b7 ("memblock: refactor internal allocation
functions") the maximal address for memblock allocations was forced to
memblock.current_limit only for the allocation functions returning virtual
address. The changes introduced by that commit moved the limit
enforcement into the allocation core and as a result the allocation
functions returning physical address also started to limit allocations to
memblock.current_limit.
This caused breakage of etnaviv GPU driver:
[ 3.682347] etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
[ 3.688669] etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
[ 3.695099] etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
[ 3.700800] etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
[ 3.723013] etnaviv-gpu 130000.gpu: command buffer outside valid
memory window
[ 3.731308] etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
[ 3.752437] etnaviv-gpu 134000.gpu: command buffer outside valid
memory window
[ 3.760583] etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
[ 3.766766] etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
Restore the behaviour of memblock_phys* family so that these functions will
not enforce memblock.current_limit.
Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
Fixes: 92d12f9544b7 ("memblock: refactor internal allocation functions")
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
Reported-by: Adam Ford <aford173(a)gmail.com>
Tested-by: Adam Ford <aford173(a)gmail.com> [imx6q-logicpd]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Fabio Estevam <festevam(a)gmail.com>
Cc: Lucas Stach <l.stach(a)pengutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memblock.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/memblock.c~mm-memblock-do-not-enforce-current-limit-for-memblock_phys-family
+++ a/mm/memblock.c
@@ -1356,9 +1356,6 @@ static phys_addr_t __init memblock_alloc
align = SMP_CACHE_BYTES;
}
- if (end > memblock.current_limit)
- end = memblock.current_limit;
-
again:
found = memblock_find_in_range_node(size, align, start, end, nid,
flags);
@@ -1469,6 +1466,9 @@ static void * __init memblock_alloc_inte
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, nid);
+ if (max_addr > memblock.current_limit)
+ max_addr = memblock.current_limit;
+
alloc = memblock_alloc_range_nid(size, align, min_addr, max_addr, nid);
/* retry allocation without lower limit */
_
Patches currently in -mm which might be from rppt(a)linux.ibm.com are
The patch titled
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
has been removed from the -mm tree. Its filename was
mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Honglei Wang <honglei.wang(a)oracle.com>
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective. Tim has reported that this is not really
the case for their database benchmark which is showing an opposite results
where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
of the system time which is roughly 7% of total cpu cycles) on 5.3
kernels. The workload is running on a larger machine (96cpus), it has
many cgroups (500) and it is heavily direct reclaim bound.
Tim Chen said:
: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
:
:
: 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size
: |
: |--3.07%--lruvec_lru_size
: | |
: | |--2.11%--cpumask_next
: | | |
: | | --1.66%--find_next_bit
: | |
: | --0.57%--call_function_interrupt
: | |
: | --0.55%--smp_call_function_interrupt
: |
: |--1.59%--0x441f0fc3d009
: | _ops_rdtsc_init_base_freq
: | access_histogram
: | page_fault
: | __do_page_fault
: | handle_mm_fault
: | __handle_mm_fault
: | |
: | --1.54%--do_swap_page
: | swapin_readahead
: | swap_cluster_readahead
: | |
: | --1.53%--read_swap_cache_async
: | __read_swap_cache_async
: | alloc_pages_vma
: | __alloc_pages_nodemask
: | __alloc_pages_slowpath
: | try_to_free_pages
: | do_try_to_free_pages
: | shrink_node
: | shrink_node_memcg
: | |
: | |--0.77%--lruvec_lru_size
: | |
: | --0.76%--inactive_list_is_low
: | |
: | --0.76%--lruvec_lru_size
: |
: --1.50%--measure_read
: page_fault
: __do_page_fault
: handle_mm_fault
: __handle_mm_fault
: do_swap_page
: swapin_readahead
: swap_cluster_readahead
: |
: --1.48%--read_swap_cache_async
: __read_swap_cache_async
: alloc_pages_vma
: __alloc_pages_nodemask
: __alloc_pages_slowpath
: try_to_free_pages
: do_try_to_free_pages
: shrink_node
: shrink_node_memcg
: |
: |--0.75%--inactive_list_is_low
: | |
: | --0.75%--lruvec_lru_size
: |
: --0.73%--lruvec_lru_size
The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint. If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles
Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang(a)oracle.com>
Reported-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Acked-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Tested-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: <stable(a)vger.kernel.org> [5.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
+++ a/mm/vmscan.c
@@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
*/
unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
{
- unsigned long lru_size;
+ unsigned long lru_size = 0;
int zid;
- if (!mem_cgroup_disabled())
- lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
- else
+ if (!mem_cgroup_disabled()) {
+ for (zid = 0; zid < MAX_NR_ZONES; zid++)
+ lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+ } else
lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
_
Patches currently in -mm which might be from honglei.wang(a)oracle.com are
The patch titled
Subject: mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
has been removed from the -mm tree. Its filename was
mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
Karsten reported the following panic in __free_slab() happening on a s390x
machine:
349.361168 Unable to handle kernel pointer dereference in virtual kernel address space
349.361210 Failing address: 0000000000000000 TEID: 0000000000000483
349.361223 Fault in home space mode while using kernel ASCE.
349.361240 AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
349.361340 Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
349.361349 Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT \
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle \
ip6table_raw ip6table_security iptable_at nf_nat
349.361436 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
349.361445 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
349.361450 Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
349.361464 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
349.361470 Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
349.361475 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
349.361481 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361481 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361486 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
349.361500 Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
349.361500 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
349.361500 #00000000003cadb2: a7f4fe36 brc 15,3caa1e
349.361500 >00000000003cadb6: e32060800024 stg %r2,128(%r6)
349.361500 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
349.361500 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
349.361500 00000000003cadc6: a7f4fe2c brc 15,3caa
349.361500 00000000003cadc6: a7f4fe2c brc 15,3caa1e
349.361500 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
349.361619 Call Trace:
349.361627 (<00000000003cabcc> __free_slab+0x49c/0x6b0)
349.361634 <00000000001f5886> rcu_core+0x5a6/0x7e0
349.361643 <0000000000ca2dea> __do_softirq+0xf2/0x5c0
349.361652 <0000000000152644> irq_exit+0x104/0x130
349.361659 <000000000010d222> do_IRQ+0x9a/0xf0
349.361667 <0000000000ca2344> ext_int_handler+0x130/0x134
349.361674 <0000000000103648> enabled_wait+0x58/0x128
349.361681 (<0000000000103634> enabled_wait+0x44/0x128)
349.361688 <0000000000103b00> arch_cpu_idle+0x40/0x58
349.361695 <0000000000ca0544> default_idle_call+0x3c/0x68
349.361704 <000000000018eaa4> do_idle+0xec/0x1c0
349.361748 <000000000018ee0e> cpu_startup_entry+0x36/0x40
349.361756 <000000000122df34> arch_call_rest_init+0x5c/0x88
349.361761 <0000000000000000> 0x0
349.361765 INFO: lockdep is turned off.
349.361769 Last Breaking-Event-Address:
349.361774 <00000000003ca8f4> __free_slab+0x1c4/0x6b0
349.361781 Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup. These pages are released by
free_partial() at the beginning of shutdown_cache(): either directly or by
scheduling a RCU-delayed work (if the kmem_cache has the
SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported panic
can happen: memcg_unlink_cache() is called immediately after shrinking
partial lists, without waiting for scheduled RCU works. It sets the
kmem_cache->memcg_params.memcg pointer to NULL, and the following attempt
to dereference it by __free_slab() from the RCU work context causes the
panic.
To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params(). It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.
Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Reported-by: Karsten Graul <kgraul(a)linux.ibm.com>
Tested-by: Karsten Graul <kgraul(a)linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Karsten Graul <kgraul(a)linux.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/slab_common.c~mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release
+++ a/mm/slab_common.c
@@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem
static void destroy_memcg_params(struct kmem_cache *s)
{
- if (is_root_cache(s))
+ if (is_root_cache(s)) {
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
- else
+ } else {
+ mem_cgroup_put(s->memcg_params.memcg);
+ WRITE_ONCE(s->memcg_params.memcg, NULL);
percpu_ref_exit(&s->memcg_params.refcnt);
+ }
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct km
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
- mem_cgroup_put(s->memcg_params.memcg);
- WRITE_ONCE(s->memcg_params.memcg, NULL);
}
}
#else
_
Patches currently in -mm which might be from guro(a)fb.com are
The patch titled
Subject: mm/memunmap: don't access uninitialized memmap in memunmap_pages()
has been removed from the -mm tree. Its filename was
mm-memunmap-dont-access-uninitialized-memmap-in-memunmap_pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Subject: mm/memunmap: don't access uninitialized memmap in memunmap_pages()
Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect the
ZONE_DEVICE as contiguous).
We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of DIMMs
at zone boundaries.
--------------------------------------------------------------------------
Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.
Example:
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 98304
present 65536
managed 65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 32768
present 32768
managed 32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
This patch (of 10):
With an altmap, the memmap falling into the reserved altmap space are not
initialized and, therefore, contain a garbage NID and a garbage zone.
Make sure to read the NID/zone from a memmap that was initialized.
This fixes a kernel crash that is observed when destroying a namespace:
[ 81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
pc: c0000000004b9728: memunmap_pages+0x238/0x340
lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
pid = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
[c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
[c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
[c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
[c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
[c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
[c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
[c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
[c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
[c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
[c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
think we will never have driver reserved memory with
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
[david(a)redhat.com: minimze code changes, rephrase description]
Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Damian Tometzki <damian.tometzki(a)gmail.com>
Cc: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christian Borntraeger <borntraeger(a)de.ibm.com>
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Fenghua Yu <fenghua.yu(a)intel.com>
Cc: Gerald Schaefer <gerald.schaefer(a)de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Halil Pasic <pasic(a)linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jun Yao <yaojun8558363(a)gmail.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Pankaj Gupta <pagupta(a)redhat.com>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Cc: Pavel Tatashin <pavel.tatashin(a)microsoft.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Rich Felker <dalias(a)libc.org>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: Steve Capper <steve.capper(a)arm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Wei Yang <richardw.yang(a)linux.intel.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yoshinori Sato <ysato(a)users.sourceforge.jp>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memremap.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
--- a/mm/memremap.c~mm-memunmap-dont-access-uninitialized-memmap-in-memunmap_pages
+++ a/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct d
void memunmap_pages(struct dev_pagemap *pgmap)
{
struct resource *res = &pgmap->res;
+ struct page *first_page;
unsigned long pfn;
int nid;
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *
put_page(pfn_to_page(pfn));
dev_pagemap_cleanup(pgmap);
+ /* make sure to access a memmap that was actually initialized */
+ first_page = pfn_to_page(pfn_first(pgmap));
+
/* pages are dead and unused, undo the arch mapping */
- nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+ nid = page_to_nid(first_page);
mem_hotplug_begin();
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- pfn = PHYS_PFN(res->start);
- __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
- PHYS_PFN(resource_size(res)), NULL);
+ __remove_pages(page_zone(first_page), PHYS_PFN(res->start),
+ PHYS_PFN(resource_size(res)), NULL);
} else {
arch_remove_memory(nid, res->start, resource_size(res),
pgmap_altmap(pgmap));
_
Patches currently in -mm which might be from aneesh.kumar(a)linux.ibm.com are
mm-pgmap-use-correct-alignment-when-looking-at-first-pfn-from-a-region.patch
mm-memmap_init-update-variable-name-in-memmap_init_zone.patch
The patch titled
Subject: mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()
has been removed from the -mm tree. Its filename was
mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_pgdat_span.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()
We might use the nid of memmaps that were never initialized. For example,
if the memmap was poisoned, we will crash the kernel in pfn_to_nid() right
now. Let's use the calculated boundaries of the separate zones instead.
This now also avoids having to iterate over a whole bunch of subsections
again, after shrinking one zone.
Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"),
the memmap was initialized to 0 and the node was set to the right value.
After that commit, the node might be garbage.
We'll have to fix shrink_zone_span() next.
Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b319]
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Wei Yang <richardw.yang(a)linux.intel.com>
Cc: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christian Borntraeger <borntraeger(a)de.ibm.com>
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Damian Tometzki <damian.tometzki(a)gmail.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Fenghua Yu <fenghua.yu(a)intel.com>
Cc: Gerald Schaefer <gerald.schaefer(a)de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Halil Pasic <pasic(a)linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Jun Yao <yaojun8558363(a)gmail.com>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Pankaj Gupta <pagupta(a)redhat.com>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Pavel Tatashin <pavel.tatashin(a)microsoft.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Rich Felker <dalias(a)libc.org>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: Steve Capper <steve.capper(a)arm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yoshinori Sato <ysato(a)users.sourceforge.jp>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory_hotplug.c | 74 +++++++++---------------------------------
1 file changed, 16 insertions(+), 58 deletions(-)
--- a/mm/memory_hotplug.c~mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_pgdat_span
+++ a/mm/memory_hotplug.c
@@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone
zone_span_writeunlock(zone);
}
-static void shrink_pgdat_span(struct pglist_data *pgdat,
- unsigned long start_pfn, unsigned long end_pfn)
+static void update_pgdat_span(struct pglist_data *pgdat)
{
- unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
- unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
- unsigned long pgdat_end_pfn = p;
- unsigned long pfn;
- int nid = pgdat->node_id;
-
- if (pgdat_start_pfn == start_pfn) {
- /*
- * If the section is smallest section in the pgdat, it need
- * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
- * In this case, we find second smallest valid mem_section
- * for shrinking zone.
- */
- pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
- pgdat_end_pfn);
- if (pfn) {
- pgdat->node_start_pfn = pfn;
- pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
- }
- } else if (pgdat_end_pfn == end_pfn) {
- /*
- * If the section is biggest section in the pgdat, it need
- * shrink pgdat->node_spanned_pages.
- * In this case, we find second biggest valid mem_section for
- * shrinking zone.
- */
- pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
- start_pfn);
- if (pfn)
- pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
- }
-
- /*
- * If the section is not biggest or smallest mem_section in the pgdat,
- * it only creates a hole in the pgdat. So in this case, we need not
- * change the pgdat.
- * But perhaps, the pgdat has only hole data. Thus it check the pgdat
- * has only hole or not.
- */
- pfn = pgdat_start_pfn;
- for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
- if (unlikely(!pfn_valid(pfn)))
- continue;
-
- if (pfn_to_nid(pfn) != nid)
- continue;
-
- /* Skip range to be removed */
- if (pfn >= start_pfn && pfn < end_pfn)
- continue;
+ unsigned long node_start_pfn = 0, node_end_pfn = 0;
+ struct zone *zone;
- /* If we find valid section, we have nothing to do */
- return;
+ for (zone = pgdat->node_zones;
+ zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
+ unsigned long zone_end_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+
+ /* No need to lock the zones, they can't change. */
+ if (zone_end_pfn > node_end_pfn)
+ node_end_pfn = zone_end_pfn;
+ if (zone->zone_start_pfn < node_start_pfn)
+ node_start_pfn = zone->zone_start_pfn;
}
- /* The pgdat has no valid section */
- pgdat->node_start_pfn = 0;
- pgdat->node_spanned_pages = 0;
+ pgdat->node_start_pfn = node_start_pfn;
+ pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
}
static void __remove_zone(struct zone *zone, unsigned long start_pfn,
@@ -507,7 +465,7 @@ static void __remove_zone(struct zone *z
pgdat_resize_lock(zone->zone_pgdat, &flags);
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
- shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+ update_pgdat_span(pgdat);
pgdat_resize_unlock(zone->zone_pgdat, &flags);
}
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-memory_hotplug-export-generic_online_page.patch
hv_balloon-use-generic_online_page.patch
mm-memory_hotplug-remove-__online_page_free-and-__online_page_increment_counters.patch
mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_zone_span.patch
mm-memory_hotplug-shrink-zones-when-offlining-memory.patch
mm-memory_hotplug-poison-memmap-in-remove_pfn_range_from_zone.patch
mm-memory_hotplug-we-always-have-a-zone-in-find_smallestbiggest_section_pfn.patch
mm-memory_hotplug-dont-check-for-all-holes-in-shrink_zone_span.patch
mm-memory_hotplug-drop-local-variables-in-shrink_zone_span.patch
mm-memory_hotplug-cleanup-__remove_pages.patch
The patch titled
Subject: mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo
has been removed from the -mm tree. Its filename was
mm-page_owner-dont-access-uninitialized-memmaps-when-reading-proc-pagetypeinfo.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Qian Cai <cai(a)lca.pw>
Subject: mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched.
For example, when not onlining a memory block that is spanned by a zone
and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
:/# echo 1 > /sys/devices/system/memory/memory40/online
:/# echo 1 > /sys/devices/system/memory/memory42/online
:/# cat /proc/pagetypeinfo > test.file
[ 42.489856] page:fffff2c585200000 is uninitialized and poisoned
[ 42.489861] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[ 42.492235] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[ 42.493501] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 42.494533] There is not page extension available.
[ 42.495358] ------------[ cut here ]------------
[ 42.496163] kernel BUG at include/linux/mm.h:1107!
[ 42.497069] invalid opcode: 0000 [#1] SMP NOPTI
Please note that this change does not affect ZONE_DEVICE, because
pagetypeinfo_showmixedcount_print() is called from
mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
ZONE_DEVICE is never populated (zone->present_pages always 0).
[david(a)redhat.com: move check to outer loop, add comment, rephrase description]
Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
Signed-off-by: Qian Cai <cai(a)lca.pw>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Cc: Miles Chen <miles.chen(a)mediatek.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_owner.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/page_owner.c~mm-page_owner-dont-access-uninitialized-memmaps-when-reading-proc-pagetypeinfo
+++ a/mm/page_owner.c
@@ -271,7 +271,8 @@ void pagetypeinfo_showmixedcount_print(s
* not matter as the mixed block count will still be correct
*/
for (; pfn < end_pfn; ) {
- if (!pfn_valid(pfn)) {
+ page = pfn_to_online_page(pfn);
+ if (!page) {
pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
continue;
}
@@ -279,13 +280,13 @@ void pagetypeinfo_showmixedcount_print(s
block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
block_end_pfn = min(block_end_pfn, end_pfn);
- page = pfn_to_page(pfn);
pageblock_mt = get_pageblock_migratetype(page);
for (; pfn < block_end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
continue;
+ /* The pageblock is online, no need to recheck. */
page = pfn_to_page(pfn);
if (page_zone(page) != zone)
_
Patches currently in -mm which might be from cai(a)lca.pw are
z3fold-add-inter-page-compaction-fix.patch
hugetlb-remove-unused-hstate-in-hugetlb_fault_mutex_hash-fix-fix.patch
The patch titled
Subject: mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
has been removed from the -mm tree. Its filename was
mm-memory-failurec-dont-access-uninitialized-memmaps-in-memory_failure.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
We should check for pfn_to_online_page() to not access uninitialized
memmaps. Reshuffle the code so we don't have to duplicate the error
message.
Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failurec-dont-access-uninitialized-memmaps-in-memory_failure
+++ a/mm/memory-failure.c
@@ -1257,17 +1257,19 @@ int memory_failure(unsigned long pfn, in
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
- if (!pfn_valid(pfn)) {
+ p = pfn_to_online_page(pfn);
+ if (!p) {
+ if (pfn_valid(pfn)) {
+ pgmap = get_dev_pagemap(pfn, NULL);
+ if (pgmap)
+ return memory_failure_dev_pagemap(pfn, flags,
+ pgmap);
+ }
pr_err("Memory failure: %#lx: memory outside kernel control\n",
pfn);
return -ENXIO;
}
- pgmap = get_dev_pagemap(pfn, NULL);
- if (pgmap)
- return memory_failure_dev_pagemap(pfn, flags, pgmap);
-
- p = pfn_to_page(pfn);
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
if (TestSetPageHWPoison(p)) {
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-memory_hotplug-export-generic_online_page.patch
hv_balloon-use-generic_online_page.patch
mm-memory_hotplug-remove-__online_page_free-and-__online_page_increment_counters.patch
mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_zone_span.patch
mm-memory_hotplug-shrink-zones-when-offlining-memory.patch
mm-memory_hotplug-poison-memmap-in-remove_pfn_range_from_zone.patch
mm-memory_hotplug-we-always-have-a-zone-in-find_smallestbiggest_section_pfn.patch
mm-memory_hotplug-dont-check-for-all-holes-in-shrink_zone_span.patch
mm-memory_hotplug-drop-local-variables-in-shrink_zone_span.patch
mm-memory_hotplug-cleanup-__remove_pages.patch