From: Guo Ziliang <guo.ziliang(a)zte.com.cn>
Subject: mm: swap: get rid of deadloop in swapin readahead
In our testing, a deadloop task was found. Through sysrq printing, same
stack was found every time, as follows:
__swap_duplicate+0x58/0x1a0
swapcache_prepare+0x24/0x30
__read_swap_cache_async+0xac/0x220
read_swap_cache_async+0x58/0xa0
swapin_readahead+0x24c/0x628
do_swap_page+0x374/0x8a0
__handle_mm_fault+0x598/0xd60
handle_mm_fault+0x114/0x200
do_page_fault+0x148/0x4d0
do_translation_fault+0xb0/0xd4
do_mem_abort+0x50/0xb0
The reason for the deadloop is that swapcache_prepare() always returns
EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
cannot jump out of the loop. We suspect that the task that clears the
SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
priority of the task stuck in a deadloop so that the task that clears the
SWAP_HAS_CACHE flag will run. The results show that the system returns to
normal after the priority is lowered.
In our testing, multiple real-time tasks are bound to the same core, and
the task in the deadloop is the highest priority task of the core, so the
deadloop task cannot be preempted.
Although cond_resched() is used by __read_swap_cache_async, it is an empty
function in the preemptive system and cannot achieve the purpose of
releasing the CPU. A high-priority task cannot release the CPU unless
preempted by a higher-priority task. But when this task is already the
highest priority task on this core, other tasks will not be able to be
scheduled. So we think we should replace cond_resched() with
schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
call set_current_state first to set the task state, so the task will be
removed from the running queue, so as to achieve the purpose of giving up
the CPU and prevent it from running in kernel mode for too long.
(akpm: ugly hack becomes uglier. But it fixes the issue in a
backportable-to-stable fashion while we hopefully work on something
better)
Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
Signed-off-by: Guo Ziliang <guo.ziliang(a)zte.com.cn>
Reported-by: Zeal Robot <zealci(a)zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai(a)zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin(a)zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29(a)zte.com.cn>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Roger Quadros <rogerq(a)kernel.org>
Cc: Ziliang Guo <guo.ziliang(a)zte.com.cn>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swap_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/swap_state.c~mm-swap-get-rid-of-deadloop-in-swapin-readahead
+++ a/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp
* __read_swap_cache_async(), which has set SWAP_HAS_CACHE
* in swap_map, but not yet added its page to swap cache.
*/
- cond_resched();
+ schedule_timeout_uninterruptible(1);
}
/*
_
The patch titled
Subject: mm: fix panic in __alloc_pages
has been removed from the -mm tree. Its filename was
mm-fix-panic-in-__alloc_pages.patch
This patch was dropped because an alternative patch was merged
------------------------------------------------------
From: Alexey Makhalov <amakhalov(a)vmware.com>
Subject: mm: fix panic in __alloc_pages
There is a kernel panic caused by pcpu_alloc_pages() passing offlined and
uninitialized node to alloc_pages_node() leading to panic by NULL
dereferencing uninitialized NODE_DATA(nid).
CPU2 has been hot-added
BUG: unable to handle page fault for address: 0000000000001608
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11
Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
RIP: 0010:__alloc_pages+0x127/0x290
Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
Call Trace:
pcpu_alloc_pages.constprop.0+0xe4/0x1c0
pcpu_populate_chunk+0x33/0xb0
pcpu_alloc+0x4d3/0x6f0
__alloc_percpu_gfp+0xd/0x10
alloc_mem_cgroup_per_node_info+0x54/0xb0
mem_cgroup_alloc+0xed/0x2f0
mem_cgroup_css_alloc+0x33/0x2f0
css_create+0x3a/0x1f0
cgroup_apply_control_enable+0x12b/0x150
cgroup_mkdir+0xdd/0x110
kernfs_iop_mkdir+0x4f/0x80
vfs_mkdir+0x178/0x230
do_mkdirat+0xfd/0x120
__x64_sys_mkdir+0x47/0x70
? syscall_exit_to_user_mode+0x21/0x50
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Panic can be easily reproduced by disabling udev rule for automatic
onlining hot added CPU followed by CPU with memoryless node (NUMA node
with CPU only) hot add.
Hot adding CPU and memoryless node does not bring the node to online
state. Memoryless node will be onlined only during the onlining its CPU.
Node can be in one of the following states:
1. not present.(nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
NODE_DATA(nid) != NULL)
Percpu code is doing allocations for all possible CPUs. The issue happens
when it serves hot added but not yet onlined CPU when its node is in 2nd
state. This node is not ready to use, fallback to numa_mem_id().
Link: https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov(a)vmware.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Dennis Zhou <dennis(a)kernel.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/percpu-vm.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/percpu-vm.c~mm-fix-panic-in-__alloc_pages
+++ a/mm/percpu-vm.c
@@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_
gfp_t gfp)
{
unsigned int cpu, tcpu;
- int i;
+ int i, nid;
gfp |= __GFP_HIGHMEM;
for_each_possible_cpu(cpu) {
+ nid = cpu_to_node(cpu);
+ if (nid == NUMA_NO_NODE || !node_online(nid))
+ nid = numa_mem_id();
+
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
- *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ *pagep = alloc_pages_node(nid, gfp, 0);
if (!*pagep)
goto err;
}
_
Patches currently in -mm which might be from amakhalov(a)vmware.com are