Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: 365dab61f74e - Linux 5.3.7
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
All kernel binaries, config files, and logs are available for download here:
https://artifacts.cki-project.org/pipelines/234049
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out the following commit:
Repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: 365dab61f74e - Linux 5.3.7
We grabbed the 8790a8b4e158 commit of the stable queue repository.
We then merged the patchset with `git am`:
drm-free-the-writeback_job-when-it-with-an-empty-fb.patch
drm-clear-the-fence-pointer-when-writeback-job-signa.patch
clk-ti-dra7-fix-mcasp8-clock-bits.patch
arm-dts-fix-wrong-clocks-for-dra7-mcasp.patch
nvme-pci-fix-a-race-in-controller-removal.patch
scsi-ufs-skip-shutdown-if-hba-is-not-powered.patch
scsi-megaraid-disable-device-when-probe-failed-after.patch
scsi-qla2xxx-silence-fwdump-template-message.patch
scsi-qla2xxx-fix-unbound-sleep-in-fcport-delete-path.patch
scsi-qla2xxx-fix-stale-mem-access-on-driver-unload.patch
scsi-qla2xxx-fix-n2n-link-reset.patch
scsi-qla2xxx-fix-n2n-link-up-fail.patch
arm-dts-fix-gpio0-flags-for-am335x-icev2.patch
arm-omap2-fix-missing-reset-done-flag-for-am3-and-am.patch
arm-omap2-add-missing-lcdc-midlemode-for-am335x.patch
arm-omap2-fix-warnings-with-broken-omap2_set_init_vo.patch
nvme-tcp-fix-wrong-stop-condition-in-io_work.patch
nvme-pci-save-pci-state-before-putting-drive-into-de.patch
nvme-fix-an-error-code-in-nvme_init_subsystem.patch
nvme-rdma-fix-max_hw_sectors-calculation.patch
added-quirks-for-adata-xpg-sx8200-pro-512gb.patch
nvme-add-quirk-for-kingston-nvme-ssd-running-fw-e8fk.patch
nvme-allow-64-bit-results-in-passthru-commands.patch
drm-komeda-prevent-memory-leak-in-komeda_wb_connecto.patch
nvme-rdma-fix-possible-use-after-free-in-connect-tim.patch
blk-mq-honor-io-scheduler-for-multiqueue-devices.patch
ieee802154-ca8210-prevent-memory-leak.patch
arm-dts-am4372-set-memory-bandwidth-limit-for-dispc.patch
net-dsa-qca8k-use-up-to-7-ports-for-all-operations.patch
mips-dts-ar9331-fix-interrupt-controller-size.patch
xen-efi-set-nonblocking-callbacks.patch
loop-change-queue-block-size-to-match-when-using-dio.patch
nl80211-fix-null-pointer-dereference.patch
mac80211-fix-txq-null-pointer-dereference.patch
netfilter-nft_connlimit-disable-bh-on-garbage-collec.patch
net-mscc-ocelot-add-missing-of_node_put-after-callin.patch
net-dsa-rtl8366rb-add-missing-of_node_put-after-call.patch
net-stmmac-xgmac-not-all-unicast-addresses-may-be-av.patch
net-stmmac-dwmac4-always-update-the-mac-hash-filter.patch
net-stmmac-correctly-take-timestamp-for-ptpv2.patch
net-stmmac-do-not-stop-phy-if-wol-is-enabled.patch
net-ag71xx-fix-mdio-subnode-support.patch
risc-v-clear-load-reservations-while-restoring-hart-.patch
riscv-fix-memblock-reservation-for-device-tree-blob.patch
drm-amdgpu-fix-multiple-memory-leaks-in-acp_hw_init.patch
drm-amd-display-memory-leak.patch
mips-loongson-fix-the-link-time-qualifier-of-serial_.patch
net-hisilicon-fix-usage-of-uninitialized-variable-in.patch
net-stmmac-avoid-deadlock-on-suspend-resume.patch
selftests-kvm-fix-libkvm-build-error.patch
lib-textsearch-fix-escapes-in-example-code.patch
s390-mm-fix-wunused-but-set-variable-warnings.patch
r8152-set-macpassthru-in-reset_resume-callback.patch
net-phy-allow-for-reset-line-to-be-tied-to-a-sleepy-.patch
net-phy-fix-write-to-mii-ctrl1000-register.patch
namespace-fix-namespace.pl-script-to-support-relativ.patch
convert-filldir-64-from-__put_user-to-unsafe_put_use.patch
elf-don-t-use-map_fixed_noreplace-for-elf-executable.patch
Compile testing
---------------
We compiled the kernel for 3 architectures:
aarch64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
Host 1:
✅ Boot test
✅ xfstests: xfs
✅ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
🚧 ✅ Storage blktests
Host 2:
✅ Boot test
✅ Podman system integration test (as root)
✅ Podman system integration test (as user)
✅ LTP lite
✅ Loopdev Sanity
✅ jvm test suite
✅ AMTU (Abstract Machine Test Utility)
✅ LTP: openposix test suite
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ iotop: sanity
✅ tuned: tune-processes-through-perf
✅ Usex - version 1.9-29
✅ storage: SCSI VPD
✅ stress: stress-ng
🚧 ✅ POSIX pjd-fstest suites
ppc64le:
Host 1:
✅ Boot test
✅ Podman system integration test (as root)
✅ Podman system integration test (as user)
✅ LTP lite
✅ Loopdev Sanity
✅ jvm test suite
✅ AMTU (Abstract Machine Test Utility)
✅ LTP: openposix test suite
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ iotop: sanity
✅ tuned: tune-processes-through-perf
✅ Usex - version 1.9-29
🚧 ✅ POSIX pjd-fstest suites
Host 2:
✅ Boot test
✅ xfstests: xfs
✅ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
🚧 ✅ Storage blktests
x86_64:
Host 1:
✅ Boot test
✅ Podman system integration test (as root)
✅ Podman system integration test (as user)
✅ LTP lite
✅ Loopdev Sanity
✅ jvm test suite
✅ AMTU (Abstract Machine Test Utility)
✅ LTP: openposix test suite
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ iotop: sanity
✅ tuned: tune-processes-through-perf
✅ pciutils: sanity smoke test
✅ Usex - version 1.9-29
✅ storage: SCSI VPD
✅ stress: stress-ng
🚧 ✅ POSIX pjd-fstest suites
Host 2:
✅ Boot test
✅ xfstests: xfs
✅ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
🚧 ✅ Storage blktests
Test sources: https://github.com/CKI-project/tests-beaker
💚 Pull requests are welcome for new tests or improvements to existing tests!
Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.
Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running are marked with ⏱. Reports for non-upstream kernels have
a Beaker recipe linked to next to each host.
From: David Hildenbrand <david(a)redhat.com>
Subject: hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched.
Let's make sure that we only consider online memory (managed by the buddy)
that has initialized memmaps. ZONE_DEVICE is not applicable.
page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and
CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be
the case when an offline memory block (e.g., never onlined) is spanned by
a zone.
Note: As explained by Michal in [1], alloc_contig_range() will verify the
range. So it boils down to the wrong access in this function.
[1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Michal Hocko <mhocko(a)kernel.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/mm/hugetlb.c~hugetlbfs-dont-access-uninitialized-memmaps-in-pfn_range_valid_gigantic
+++ a/mm/hugetlb.c
@@ -1084,11 +1084,10 @@ static bool pfn_range_valid_gigantic(str
struct page *page;
for (i = start_pfn; i < end_pfn; i++) {
- if (!pfn_valid(i))
+ page = pfn_to_online_page(i);
+ if (!page)
return false;
- page = pfn_to_page(i);
-
if (page_zone(page) != z)
return false;
_
From: Honglei Wang <honglei.wang(a)oracle.com>
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective. Tim has reported that this is not really
the case for their database benchmark which is showing an opposite results
where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
of the system time which is roughly 7% of total cpu cycles) on 5.3
kernels. The workload is running on a larger machine (96cpus), it has
many cgroups (500) and it is heavily direct reclaim bound.
Tim Chen said:
: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
:
:
: 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size
: |
: |--3.07%--lruvec_lru_size
: | |
: | |--2.11%--cpumask_next
: | | |
: | | --1.66%--find_next_bit
: | |
: | --0.57%--call_function_interrupt
: | |
: | --0.55%--smp_call_function_interrupt
: |
: |--1.59%--0x441f0fc3d009
: | _ops_rdtsc_init_base_freq
: | access_histogram
: | page_fault
: | __do_page_fault
: | handle_mm_fault
: | __handle_mm_fault
: | |
: | --1.54%--do_swap_page
: | swapin_readahead
: | swap_cluster_readahead
: | |
: | --1.53%--read_swap_cache_async
: | __read_swap_cache_async
: | alloc_pages_vma
: | __alloc_pages_nodemask
: | __alloc_pages_slowpath
: | try_to_free_pages
: | do_try_to_free_pages
: | shrink_node
: | shrink_node_memcg
: | |
: | |--0.77%--lruvec_lru_size
: | |
: | --0.76%--inactive_list_is_low
: | |
: | --0.76%--lruvec_lru_size
: |
: --1.50%--measure_read
: page_fault
: __do_page_fault
: handle_mm_fault
: __handle_mm_fault
: do_swap_page
: swapin_readahead
: swap_cluster_readahead
: |
: --1.48%--read_swap_cache_async
: __read_swap_cache_async
: alloc_pages_vma
: __alloc_pages_nodemask
: __alloc_pages_slowpath
: try_to_free_pages
: do_try_to_free_pages
: shrink_node
: shrink_node_memcg
: |
: |--0.75%--inactive_list_is_low
: | |
: | --0.75%--lruvec_lru_size
: |
: --0.73%--lruvec_lru_size
The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint. If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles
Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang(a)oracle.com>
Reported-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Acked-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Tested-by: Tim Chen <tim.c.chen(a)linux.intel.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: <stable(a)vger.kernel.org> [5.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
+++ a/mm/vmscan.c
@@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
*/
unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
{
- unsigned long lru_size;
+ unsigned long lru_size = 0;
int zid;
- if (!mem_cgroup_disabled())
- lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
- else
+ if (!mem_cgroup_disabled()) {
+ for (zid = 0; zid < MAX_NR_ZONES; zid++)
+ lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+ } else
lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
_
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
Karsten reported the following panic in __free_slab() happening on a s390x
machine:
349.361168 Unable to handle kernel pointer dereference in virtual kernel address space
349.361210 Failing address: 0000000000000000 TEID: 0000000000000483
349.361223 Fault in home space mode while using kernel ASCE.
349.361240 AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
349.361340 Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
349.361349 Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT \
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle \
ip6table_raw ip6table_security iptable_at nf_nat
349.361436 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
349.361445 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
349.361450 Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
349.361464 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
349.361470 Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
349.361475 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
349.361481 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361481 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361486 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
349.361500 Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
349.361500 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
349.361500 #00000000003cadb2: a7f4fe36 brc 15,3caa1e
349.361500 >00000000003cadb6: e32060800024 stg %r2,128(%r6)
349.361500 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
349.361500 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
349.361500 00000000003cadc6: a7f4fe2c brc 15,3caa
349.361500 00000000003cadc6: a7f4fe2c brc 15,3caa1e
349.361500 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
349.361619 Call Trace:
349.361627 (<00000000003cabcc> __free_slab+0x49c/0x6b0)
349.361634 <00000000001f5886> rcu_core+0x5a6/0x7e0
349.361643 <0000000000ca2dea> __do_softirq+0xf2/0x5c0
349.361652 <0000000000152644> irq_exit+0x104/0x130
349.361659 <000000000010d222> do_IRQ+0x9a/0xf0
349.361667 <0000000000ca2344> ext_int_handler+0x130/0x134
349.361674 <0000000000103648> enabled_wait+0x58/0x128
349.361681 (<0000000000103634> enabled_wait+0x44/0x128)
349.361688 <0000000000103b00> arch_cpu_idle+0x40/0x58
349.361695 <0000000000ca0544> default_idle_call+0x3c/0x68
349.361704 <000000000018eaa4> do_idle+0xec/0x1c0
349.361748 <000000000018ee0e> cpu_startup_entry+0x36/0x40
349.361756 <000000000122df34> arch_call_rest_init+0x5c/0x88
349.361761 <0000000000000000> 0x0
349.361765 INFO: lockdep is turned off.
349.361769 Last Breaking-Event-Address:
349.361774 <00000000003ca8f4> __free_slab+0x1c4/0x6b0
349.361781 Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup. These pages are released by
free_partial() at the beginning of shutdown_cache(): either directly or by
scheduling a RCU-delayed work (if the kmem_cache has the
SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported panic
can happen: memcg_unlink_cache() is called immediately after shrinking
partial lists, without waiting for scheduled RCU works. It sets the
kmem_cache->memcg_params.memcg pointer to NULL, and the following attempt
to dereference it by __free_slab() from the RCU work context causes the
panic.
To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params(). It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.
Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Reported-by: Karsten Graul <kgraul(a)linux.ibm.com>
Tested-by: Karsten Graul <kgraul(a)linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Karsten Graul <kgraul(a)linux.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/slab_common.c~mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release
+++ a/mm/slab_common.c
@@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem
static void destroy_memcg_params(struct kmem_cache *s)
{
- if (is_root_cache(s))
+ if (is_root_cache(s)) {
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
- else
+ } else {
+ mem_cgroup_put(s->memcg_params.memcg);
+ WRITE_ONCE(s->memcg_params.memcg, NULL);
percpu_ref_exit(&s->memcg_params.refcnt);
+ }
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct km
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
- mem_cgroup_put(s->memcg_params.memcg);
- WRITE_ONCE(s->memcg_params.memcg, NULL);
}
}
#else
_
From: Qian Cai <cai(a)lca.pw>
Subject: mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched.
For example, when not onlining a memory block that is spanned by a zone
and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
:/# echo 1 > /sys/devices/system/memory/memory40/online
:/# echo 1 > /sys/devices/system/memory/memory42/online
:/# cat /proc/pagetypeinfo > test.file
[ 42.489856] page:fffff2c585200000 is uninitialized and poisoned
[ 42.489861] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[ 42.492235] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[ 42.493501] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 42.494533] There is not page extension available.
[ 42.495358] ------------[ cut here ]------------
[ 42.496163] kernel BUG at include/linux/mm.h:1107!
[ 42.497069] invalid opcode: 0000 [#1] SMP NOPTI
Please note that this change does not affect ZONE_DEVICE, because
pagetypeinfo_showmixedcount_print() is called from
mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
ZONE_DEVICE is never populated (zone->present_pages always 0).
[david(a)redhat.com: move check to outer loop, add comment, rephrase description]
Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
Signed-off-by: Qian Cai <cai(a)lca.pw>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Cc: Miles Chen <miles.chen(a)mediatek.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_owner.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/page_owner.c~mm-page_owner-dont-access-uninitialized-memmaps-when-reading-proc-pagetypeinfo
+++ a/mm/page_owner.c
@@ -271,7 +271,8 @@ void pagetypeinfo_showmixedcount_print(s
* not matter as the mixed block count will still be correct
*/
for (; pfn < end_pfn; ) {
- if (!pfn_valid(pfn)) {
+ page = pfn_to_online_page(pfn);
+ if (!page) {
pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
continue;
}
@@ -279,13 +280,13 @@ void pagetypeinfo_showmixedcount_print(s
block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
block_end_pfn = min(block_end_pfn, end_pfn);
- page = pfn_to_page(pfn);
pageblock_mt = get_pageblock_migratetype(page);
for (; pfn < block_end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
continue;
+ /* The pageblock is online, no need to recheck. */
page = pfn_to_page(pfn);
if (page_zone(page) != zone)
_
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
We should check for pfn_to_online_page() to not access uninitialized
memmaps. Reshuffle the code so we don't have to duplicate the error
message.
Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failurec-dont-access-uninitialized-memmaps-in-memory_failure
+++ a/mm/memory-failure.c
@@ -1257,17 +1257,19 @@ int memory_failure(unsigned long pfn, in
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
- if (!pfn_valid(pfn)) {
+ p = pfn_to_online_page(pfn);
+ if (!p) {
+ if (pfn_valid(pfn)) {
+ pgmap = get_dev_pagemap(pfn, NULL);
+ if (pgmap)
+ return memory_failure_dev_pagemap(pfn, flags,
+ pgmap);
+ }
pr_err("Memory failure: %#lx: memory outside kernel control\n",
pfn);
return -ENXIO;
}
- pgmap = get_dev_pagemap(pfn, NULL);
- if (pgmap)
- return memory_failure_dev_pagemap(pfn, flags, pgmap);
-
- p = pfn_to_page(pfn);
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
if (TestSetPageHWPoison(p)) {
_
From: David Hildenbrand <david(a)redhat.com>
Subject: fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c
There are three places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup
We have initialized memmaps either when the section is online or when the
page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.
For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:
:/# cat /proc/kpagecount > tmp.test
[ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
[ 95.601238] #PF: supervisor read access in kernel mode
[ 95.601675] #PF: error_code(0x0000) - not-present page
[ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
[ 95.602596] Oops: 0000 [#1] SMP NOPTI
[ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
[ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
[ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
[ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
[ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
[ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
[ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
[ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
[ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
[ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
[ 95.611686] Call Trace:
[ 95.611906] proc_reg_read+0x3c/0x60
[ 95.612228] vfs_read+0xc5/0x180
[ 95.612505] ksys_read+0x68/0xe0
[ 95.612785] do_syscall_64+0x5c/0xa0
[ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe
For now, let's drop support for ZONE_DEVICE from the three pseudo files in
order to fix this. To distinguish offline memory (with garbage memmap)
from ZONE_DEVICE memory with properly initialized memmaps, we would have
to check get_dev_pagemap() and pfn_zone_device_reserved() right now. The
usage of both (especially, special casing devmem) is frowned upon and
needs to be reworked. The fundamental issue we have is:
if (pfn_to_online_page(pfn)) {
/* memmap initialized */
} else if (pfn_valid(pfn)) {
/*
* ???
* a) offline memory. memmap garbage.
* b) devmem: memmap initialized to ZONE_DEVICE.
* c) devmem: reserved for driver. memmap garbage.
* (d) devmem: memmap currently initializing - garbage)
*/
}
We'll leave the pfn_zone_device_reserved() check in stable_page_flags() in
place as that function is also used from memory failure. We now no longer
dump information about pages that are not in use anymore - offline.
Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Qian Cai <cai(a)lca.pw>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Cc: Stephen Rothwell <sfr(a)canb.auug.org.au>
Cc: Toshiki Fukasawa <t-fukasawa(a)vx.jp.nec.com>
Cc: Pankaj gupta <pagupta(a)redhat.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Anthony Yznaga <anthony.yznaga(a)oracle.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/page.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)
--- a/fs/proc/page.c~mm-dont-access-uninitialized-memmaps-in-fs-proc-pagec
+++ a/fs/proc/page.c
@@ -42,10 +42,12 @@ static ssize_t kpagecount_read(struct fi
return -EINVAL;
while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);
+
if (!ppage || PageSlab(ppage) || page_has_type(ppage))
pcount = 0;
else
@@ -216,10 +218,11 @@ static ssize_t kpageflags_read(struct fi
return -EINVAL;
while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);
if (put_user(stable_page_flags(ppage), out)) {
ret = -EFAULT;
@@ -261,10 +264,11 @@ static ssize_t kpagecgroup_read(struct f
return -EINVAL;
while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);
if (ppage)
ino = page_cgroup_ino(ppage);
_