Hi,
there seems to be a subtle regression with 6.1.y kernels. I had random crashes with pbuilder
running on 64bit x86 (Intel HW, but happens also inside VMs) after Debian stable used
6.1.115. On the first glance, this looks like the usual GCC seg fault crash because of faulty hardware:
...
ENABLE_TRF_FOR_NS=0 -DENCRYPT_BL31=0 -DENCRYPT_BL32=0 -DERRATA_SPECULATIVE_AT=0 -DERROR
_DEPRECATED=0 -DFAULT_INJECTION_SUPPORT=0 -DGICV2_G0_FOR_EL3=1 -DHANDLE_EA_EL3_FIRST=0
-DHW_ASSISTED_COHERENCY=0 -DLOG_LEVEL=40 -DMEASURED_BOOT=0 -DNR_OF_FW_BANKS=2 -DNR_OF_I
MAGES_IN_FW_BANK=1 -DNS_TIMER_SWITCH=0 -DPL011_GENERIC_UART=0 -DPLAT_zynqmp -DPROGRAMMA
BLE_RESET_ADDRESS=1 -DPSA_FWU_SUPPORT=0 -DPSCI_EXTENDED_STATE_ID=1 -DRAS_EXTENSION=0 -D
RAS_TRAP_LOWER_EL_ERR_ACCESS=0 -DRECLAIM_INIT_CODE=0 -DRESET_TO_BL31=1 -DSDEI_IN_FCONF=
0 -DSEC_INT_DESC_IN_FCONF=0 -DSEPARATE_CODE_AND_RODATA=1 -DSEPARATE_NOBITS_REGION=0 -DS
PD_none -DSPIN_ON_BL1_EXIT=0 -DSPMD_SPM_AT_SEL2=1 -DSPM_MM=0 -DTRNG_SUPPORT=0 -DTRUSTED
_BOARD_BOOT=0 -DUSE_COHERENT_MEM=1 -DUSE_DEBUGFS=0 -DUSE_ROMLIB=0 -DUSE_SP804_TIMER=0 -
DUSE_SPINLOCK_CAS=0 -DUSE_TBBR_DEFS=1 -DWARMBOOT_ENABLE_DCACHE_EARLY=1 -Iinclude -Iincl
ude/arch/aarch64 -Iinclude/lib/cpus/aarch64 -Iinclude/lib/el3_runtime/aarch64 -Iinclude
/plat/arm/common/ -Iinclude/plat/arm/common/aarch64/ -Iplat/xilinx/common/include/ -Iplat/xilinx/common/ipi_mailbox_service/ -Iplat/xilinx/zynqmp/include/ -Iplat/xilinx/zynqmp/pm_service/ -Iinclude/lib/libfdt -Iinclude/lib/libc -Iinclude/lib/libc/aarch64 -nostdinc -Werror -Wall -Wmissing-include-dirs -Wunused -Wdisabled-optimization -Wvla -Wshadow -Wno-unused-parameter -Wredundant-decls -Wunused-but-set-variable -Wmaybe-uninitialized -Wpacked-bitfield-compat -Wshift-overflow=2 -Wlogical-op -Wno-error=deprecated-declarations -Wno-error=cpp -march=armv8-a -mgeneral-regs-only -mstrict-align -mbranch-protection=none -ffunction-sections -fdata-sections -ffreestanding -fno-builtin -fno-common -Os -std=gnu99 -fno-PIE -fno-stack-protector -fno-jump-tables -DIMAGE_AT_EL3 -DIMAGE_BL31 -Wp,-MD,/build/arm-trusted-firmware-kk-2.6-2022-2-kk/build/zynqmp/release/bl31/plat_psci.d -MT /build/arm-trusted-firmware-kk-2.6-2022-2-kk/build/zynqmp/release/bl31/plat_psci.o -MP -c plat/xilinx/zynqmp/plat_psci.c -o /build/arm-trusted-firmware-kk-2.6-2022-2-kk/build/zynqmp/release/bl31/plat_psci.o
make[2]: *** [Makefile:1251: /build/arm-trusted-firmware-kk-2.6-2022-2-kk/build/zynqmp/release/bl31/plat_psci.o] Segmentation fault
make[2]: *** Waiting for unfinished jobs....
...
(That's a pbuilder build of the ARM trusted firmware, but it crashes with any other ARM64 application build
with pbuilder sooner or later - but NOT on the first or second run, usually after the third or fifth run)
However, the crashes were going away again when I switched back to 6.1.112 (the previous debian stable kernel).
I've git bisected it down to this commit:
b0cde867b80a5e81fcbc0383e138f5845f2005ee is the first bad commit
commit b0cde867b80a5e81fcbc0383e138f5845f2005ee
Author: Kees Cook <keescook(a)chromium.org>
Date: Fri Feb 16 22:25:43 2024 -0800
x86: Increase brk randomness entropy for 64-bit systems
[ Upstream commit 44c76825d6eefee9eb7ce06c38e1a6632ac7eb7d ]
In commit c1d171a00294 ("x86: randomize brk"), arch_randomize_brk() was
defined to use a 32MB range (13 bits of entropy), but was never increased
when moving to 64-bit. The default arch_randomize_brk() uses 32MB for
32-bit tasks, and 1GB (18 bits of entropy) for 64-bit tasks.
Update x86_64 to match the entropy used by arm64 and other 64-bit
architectures.
Reported-by: y0un9n132(a)gmail.com
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Jiri Kosina <jkosina(a)suse.com>
Closes: https://lore.kernel.org/linux-hardening/CA+2EKTVLvc8hDZc+2Yhwmus=dzOUG5E4gV…
Link: https://lore.kernel.org/r/20240217062545.1631668-1-keescook@chromium.org
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
If I revert that commit, like:
-------------------------- arch/x86/kernel/process.c --------------------------
index acc83738bf5b..279b5e9be80f 100644
@@ -991,10 +991,7 @@ unsigned long arch_align_stack(unsigned long sp)
unsigned long arch_randomize_brk(struct mm_struct *mm)
{
-if (mmap_is_ia32())
-return randomize_page(mm->brk, SZ_32M);
-
-return randomize_page(mm->brk, SZ_1G);
+return randomize_page(mm->brk, 0x02000000);
}
/*
With that revert, I can run pbuilder to compile ARM64 builds all day and it never crashes. I have no idea why
that change broke pbuilder, maybe it's something related to the way qemu is used inside the ARM64 chroot
environment, but in my opinion it's a kernel regression,
TIA,
Uli
Mit freundlichen Grüßen / Kind regards
Dipl.-Inform. Ulrich Teichert
Senior Software Developer
kumkeo GmbH
Heidenkampsweg 82a
20097 Hamburg
Germany
T: +49 40 2846761-0
F: +49 40 2846761-99
ulrich.teichert(a)kumkeo.de
www.kumkeo.de
Amtsgericht Hamburg / Hamburg District Court, HRB 108558
Geschäftsführer / Managing Director: Dipl.-Ing. Bernd Sager; Dipl.-Ing. Sven Tanneberger, MBA
The quilt patch titled
Subject: vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
has been removed from the -mm tree. Its filename was
vmstat-call-fold_vm_zone_numa_events-before-show-per-zone-numa-event.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: MengEn Sun <mengensun(a)tencent.com>
Subject: vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
Date: Fri, 1 Nov 2024 12:06:38 +0800
Since 5.14-rc1, NUMA events will only be folded from per-CPU statistics to
per zone and global statistics when the user actually needs it.
Currently, the kernel has performs the fold operation when reading
/proc/vmstat, but does not perform the fold operation in /proc/zoneinfo.
This can lead to inaccuracies in the following statistics in zoneinfo:
- numa_hit
- numa_miss
- numa_foreign
- numa_interleave
- numa_local
- numa_other
Therefore, before printing per-zone vm_numa_event when reading
/proc/zoneinfo, we should also perform the fold operation.
Link: https://lkml.kernel.org/r/1730433998-10461-1-git-send-email-mengensun@tence…
Fixes: f19298b9516c ("mm/vmstat: convert NUMA statistics to basic NUMA counters")
Signed-off-by: MengEn Sun <mengensun(a)tencent.com>
Reviewed-by: JinLiang Zheng <alexjlzheng(a)tencent.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmstat.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/vmstat.c~vmstat-call-fold_vm_zone_numa_events-before-show-per-zone-numa-event
+++ a/mm/vmstat.c
@@ -1780,6 +1780,7 @@ static void zoneinfo_show_print(struct s
zone_page_state(zone, i));
#ifdef CONFIG_NUMA
+ fold_vm_zone_numa_events(zone);
for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", numa_stat_name(i),
zone_numa_event_state(zone, i));
_
Patches currently in -mm which might be from mengensun(a)tencent.com are
The quilt patch titled
Subject: ocfs2: uncache inode which has failed entering the group
has been removed from the -mm tree. Its filename was
ocfs2-uncache-inode-which-has-failed-entering-the-group.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dmitry Antipov <dmantipov(a)yandex.ru>
Subject: ocfs2: uncache inode which has failed entering the group
Date: Thu, 14 Nov 2024 07:38:44 +0300
Syzbot has reported the following BUG:
kernel BUG at fs/ocfs2/uptodate.c:509!
...
Call Trace:
<TASK>
? __die_body+0x5f/0xb0
? die+0x9e/0xc0
? do_trap+0x15a/0x3a0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? do_error_trap+0x1dc/0x2c0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? __pfx_do_error_trap+0x10/0x10
? handle_invalid_op+0x34/0x40
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? exc_invalid_op+0x38/0x50
? asm_exc_invalid_op+0x1a/0x20
? ocfs2_set_new_buffer_uptodate+0x2e/0x160
? ocfs2_set_new_buffer_uptodate+0x144/0x160
? ocfs2_set_new_buffer_uptodate+0x145/0x160
ocfs2_group_add+0x39f/0x15a0
? __pfx_ocfs2_group_add+0x10/0x10
? __pfx_lock_acquire+0x10/0x10
? mnt_get_write_access+0x68/0x2b0
? __pfx_lock_release+0x10/0x10
? rcu_read_lock_any_held+0xb7/0x160
? __pfx_rcu_read_lock_any_held+0x10/0x10
? smack_log+0x123/0x540
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x226/0x2b0
ocfs2_ioctl+0x65e/0x7d0
? __pfx_ocfs2_ioctl+0x10/0x10
? smack_file_ioctl+0x29e/0x3a0
? __pfx_smack_file_ioctl+0x10/0x10
? lockdep_hardirqs_on_prepare+0x43d/0x780
? __pfx_lockdep_hardirqs_on_prepare+0x10/0x10
? __pfx_ocfs2_ioctl+0x10/0x10
__se_sys_ioctl+0xfb/0x170
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
When 'ioctl(OCFS2_IOC_GROUP_ADD, ...)' has failed for the particular
inode in 'ocfs2_verify_group_and_input()', corresponding buffer head
remains cached and subsequent call to the same 'ioctl()' for the same
inode issues the BUG() in 'ocfs2_set_new_buffer_uptodate()' (trying
to cache the same buffer head of that inode). Fix this by uncaching
the buffer head with 'ocfs2_remove_from_cache()' on error path in
'ocfs2_group_add()'.
Link: https://lkml.kernel.org/r/20241114043844.111847-1-dmantipov@yandex.ru
Fixes: 7909f2bf8353 ("[PATCH 2/2] ocfs2: Implement group add for online resize")
Signed-off-by: Dmitry Antipov <dmantipov(a)yandex.ru>
Reported-by: syzbot+453873f1588c2d75b447(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=453873f1588c2d75b447
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Dmitry Antipov <dmantipov(a)yandex.ru>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/resize.c | 2 ++
1 file changed, 2 insertions(+)
--- a/fs/ocfs2/resize.c~ocfs2-uncache-inode-which-has-failed-entering-the-group
+++ a/fs/ocfs2/resize.c
@@ -574,6 +574,8 @@ out_commit:
ocfs2_commit_trans(osb, handle);
out_free_group_bh:
+ if (ret < 0)
+ ocfs2_remove_from_cache(INODE_CACHE(inode), group_bh);
brelse(group_bh);
out_unlock:
_
Patches currently in -mm which might be from dmantipov(a)yandex.ru are
The quilt patch titled
Subject: mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
has been removed from the -mm tree. Its filename was
mm-fix-null-pointer-dereference-in-alloc_pages_bulk_noprof.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jinjiang Tu <tujinjiang(a)huawei.com>
Subject: mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
Date: Wed, 13 Nov 2024 16:32:35 +0800
We triggered a NULL pointer dereference for ac.preferred_zoneref->zone in
alloc_pages_bulk_noprof() when the task is migrated between cpusets.
When cpuset is enabled, in prepare_alloc_pages(), ac->nodemask may be
¤t->mems_allowed. when first_zones_zonelist() is called to find
preferred_zoneref, the ac->nodemask may be modified concurrently if the
task is migrated between different cpusets. Assuming we have 2 NUMA Node,
when traversing Node1 in ac->zonelist, the nodemask is 2, and when
traversing Node2 in ac->zonelist, the nodemask is 1. As a result, the
ac->preferred_zoneref points to NULL zone.
In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a
allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading
to NULL pointer dereference.
__alloc_pages_noprof() fixes this issue by checking NULL pointer in commit
ea57485af8f4 ("mm, page_alloc: fix check for NULL preferred_zone") and
commit df76cee6bbeb ("mm, page_alloc: remove redundant checks from alloc
fastpath").
To fix it, check NULL pointer for preferred_zoneref->zone.
Link: https://lkml.kernel.org/r/20241113083235.166798-1-tujinjiang@huawei.com
Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator")
Signed-off-by: Jinjiang Tu <tujinjiang(a)huawei.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Alexander Lobakin <alobakin(a)pm.me>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Nanyong Sun <sunnanyong(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/page_alloc.c~mm-fix-null-pointer-dereference-in-alloc_pages_bulk_noprof
+++ a/mm/page_alloc.c
@@ -4607,7 +4607,8 @@ unsigned long alloc_pages_bulk_noprof(gf
gfp = alloc_gfp;
/* Find an allowed local zone that meets the low watermark. */
- for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+ z = ac.preferred_zoneref;
+ for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask) {
unsigned long mark;
if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
_
Patches currently in -mm which might be from tujinjiang(a)huawei.com are
The quilt patch titled
Subject: fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
has been removed from the -mm tree. Its filename was
fs-proc-task_mmu-prevent-integer-overflow-in-pagemap_scan_get_args.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dan Carpenter <dan.carpenter(a)linaro.org>
Subject: fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
Date: Thu, 14 Nov 2024 11:59:32 +0300
The "arg->vec_len" variable is a u64 that comes from the user at the start
of the function. The "arg->vec_len * sizeof(struct page_region))"
multiplication can lead to integer wrapping. Use size_mul() to avoid
that.
Also the size_add/mul() functions work on unsigned long so for 32bit
systems we need to ensure that "arg->vec_len" fits in an unsigned long.
Link: https://lkml.kernel.org/r/39d41335-dd4d-48ed-8a7f-402c57d8ea84@stanley.moun…
Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Cc: Andrei Vagin <avagin(a)google.com>
Cc: Andrii Nakryiko <andrii(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Micha�� Miros��aw <mirq-linux(a)rere.qmqm.pl>
Cc: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/fs/proc/task_mmu.c~fs-proc-task_mmu-prevent-integer-overflow-in-pagemap_scan_get_args
+++ a/fs/proc/task_mmu.c
@@ -2665,8 +2665,10 @@ static int pagemap_scan_get_args(struct
return -EFAULT;
if (!arg->vec && arg->vec_len)
return -EINVAL;
+ if (UINT_MAX == SIZE_MAX && arg->vec_len > SIZE_MAX)
+ return -EINVAL;
if (arg->vec && !access_ok((void __user *)(long)arg->vec,
- arg->vec_len * sizeof(struct page_region)))
+ size_mul(arg->vec_len, sizeof(struct page_region))))
return -EFAULT;
/* Fixup default values */
_
Patches currently in -mm which might be from dan.carpenter(a)linaro.org are
The quilt patch titled
Subject: sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
has been removed from the -mm tree. Its filename was
sched-task_stack-fix-object_is_on_stack-for-kasan-tagged-pointers.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Qun-Wei Lin <qun-wei.lin(a)mediatek.com>
Subject: sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
Date: Wed, 13 Nov 2024 12:25:43 +0800
When CONFIG_KASAN_SW_TAGS and CONFIG_KASAN_STACK are enabled, the
object_is_on_stack() function may produce incorrect results due to the
presence of tags in the obj pointer, while the stack pointer does not have
tags. This discrepancy can lead to incorrect stack object detection and
subsequently trigger warnings if CONFIG_DEBUG_OBJECTS is also enabled.
Example of the warning:
ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:557 __debug_object_init+0x330/0x364
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc5 #4
Hardware name: linux,dummy-virt (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __debug_object_init+0x330/0x364
lr : __debug_object_init+0x330/0x364
sp : ffff800082ea7b40
x29: ffff800082ea7b40 x28: 98ff0000c0164518 x27: 98ff0000c0164534
x26: ffff800082d93ec8 x25: 0000000000000001 x24: 1cff0000c00172a0
x23: 0000000000000000 x22: ffff800082d93ed0 x21: ffff800081a24418
x20: 3eff800082ea7bb0 x19: efff800000000000 x18: 0000000000000000
x17: 00000000000000ff x16: 0000000000000047 x15: 206b63617473206e
x14: 0000000000000018 x13: ffff800082ea7780 x12: 0ffff800082ea78e
x11: 0ffff800082ea790 x10: 0ffff800082ea79d x9 : 34d77febe173e800
x8 : 34d77febe173e800 x7 : 0000000000000001 x6 : 0000000000000001
x5 : feff800082ea74b8 x4 : ffff800082870a90 x3 : ffff80008018d3c4
x2 : 0000000000000001 x1 : ffff800082858810 x0 : 0000000000000050
Call trace:
__debug_object_init+0x330/0x364
debug_object_init_on_stack+0x30/0x3c
schedule_hrtimeout_range_clock+0xac/0x26c
schedule_hrtimeout+0x1c/0x30
wait_task_inactive+0x1d4/0x25c
kthread_bind_mask+0x28/0x98
init_rescuer+0x1e8/0x280
workqueue_init+0x1a0/0x3cc
kernel_init_freeable+0x118/0x200
kernel_init+0x28/0x1f0
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated.
------------[ cut here ]------------
Link: https://lkml.kernel.org/r/20241113042544.19095-1-qun-wei.lin@mediatek.com
Signed-off-by: Qun-Wei Lin <qun-wei.lin(a)mediatek.com>
Cc: Andrew Yang <andrew.yang(a)mediatek.com>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
Cc: Casper Li <casper.li(a)mediatek.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Chinwen Chang <chinwen.chang(a)mediatek.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/sched/task_stack.h | 2 ++
1 file changed, 2 insertions(+)
--- a/include/linux/sched/task_stack.h~sched-task_stack-fix-object_is_on_stack-for-kasan-tagged-pointers
+++ a/include/linux/sched/task_stack.h
@@ -9,6 +9,7 @@
#include <linux/sched.h>
#include <linux/magic.h>
#include <linux/refcount.h>
+#include <linux/kasan.h>
#ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -89,6 +90,7 @@ static inline int object_is_on_stack(con
{
void *stack = task_stack_page(current);
+ obj = kasan_reset_tag(obj);
return (obj >= stack) && (obj < (stack + THREAD_SIZE));
}
_
Patches currently in -mm which might be from qun-wei.lin(a)mediatek.com are
The quilt patch titled
Subject: crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
has been removed from the -mm tree. Its filename was
crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dave Vasilevsky <dave(a)vasilevsky.ca>
Subject: crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
Date: Tue, 17 Sep 2024 12:37:20 -0400
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware.
On these machines, the kernel refuses to boot from non-zero
PHYSICAL_START, which occurs when CRASH_DUMP is on.
Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should
default to off for them. Users booting via some other mechanism can still
turn it on explicitly.
Does not change the default on any other architectures for the
time being.
Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca
Fixes: 75bc255a7444 ("crash: clean up kdump related config items")
Signed-off-by: Dave Vasilevsky <dave(a)vasilevsky.ca>
Reported-by: Reimar D��ffinger <Reimar.Doeffinger(a)gmx.de>
Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html
Acked-by: Michael Ellerman <mpe(a)ellerman.id.au> [powerpc]
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: John Paul Adrian Glaubitz <glaubitz(a)physik.fu-berlin.de>
Cc: Reimar D��ffinger <Reimar.Doeffinger(a)gmx.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm/Kconfig | 3 +++
arch/arm64/Kconfig | 3 +++
arch/loongarch/Kconfig | 3 +++
arch/mips/Kconfig | 3 +++
arch/powerpc/Kconfig | 4 ++++
arch/riscv/Kconfig | 3 +++
arch/s390/Kconfig | 3 +++
arch/sh/Kconfig | 3 +++
arch/x86/Kconfig | 3 +++
kernel/Kconfig.kexec | 2 +-
10 files changed, 29 insertions(+), 1 deletion(-)
--- a/arch/arm64/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/arm64/Kconfig
@@ -1576,6 +1576,9 @@ config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_S
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_RESERVE
--- a/arch/arm/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/arm/Kconfig
@@ -1598,6 +1598,9 @@ config ATAGS_PROC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config AUTO_ZRELADDR
bool "Auto calculation of the decompressed kernel image address" if !ARCH_MULTIPLATFORM
default !(ARCH_FOOTBRIDGE || ARCH_RPC || ARCH_SA1100)
--- a/arch/loongarch/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/loongarch/Kconfig
@@ -604,6 +604,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
--- a/arch/mips/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/mips/Kconfig
@@ -2876,6 +2876,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded"
default "0xffffffff84000000"
--- a/arch/powerpc/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/powerpc/Kconfig
@@ -684,6 +684,10 @@ config RELOCATABLE_TEST
config ARCH_SUPPORTS_CRASH_DUMP
def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+config ARCH_DEFAULT_CRASH_DUMP
+ bool
+ default y if !PPC_BOOK3S_32
+
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
--- a/arch/riscv/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/riscv/Kconfig
@@ -898,6 +898,9 @@ config ARCH_SUPPORTS_KEXEC_PURGATORY
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_RESERVE
--- a/arch/s390/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/s390/Kconfig
@@ -276,6 +276,9 @@ config ARCH_SUPPORTS_CRASH_DUMP
This option also enables s390 zfcpdump.
See also <file:Documentation/arch/s390/zfcpdump.rst>
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
menu "Processor type and features"
config HAVE_MARCH_Z10_FEATURES
--- a/arch/sh/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/sh/Kconfig
@@ -550,6 +550,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool BROKEN_ON_SMP
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y
--- a/arch/x86/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/x86/Kconfig
@@ -2084,6 +2084,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_HOTPLUG
def_bool y
--- a/kernel/Kconfig.kexec~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/kernel/Kconfig.kexec
@@ -97,7 +97,7 @@ config KEXEC_JUMP
config CRASH_DUMP
bool "kernel crash dumps"
- default y
+ default ARCH_DEFAULT_CRASH_DUMP
depends on ARCH_SUPPORTS_CRASH_DUMP
depends on KEXEC_CORE
select VMCORE_INFO
_
Patches currently in -mm which might be from dave(a)vasilevsky.ca are
The quilt patch titled
Subject: mm/mremap: fix address wraparound in move_page_tables()
has been removed from the -mm tree. Its filename was
mm-mremap-fix-address-wraparound-in-move_page_tables.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm/mremap: fix address wraparound in move_page_tables()
Date: Mon, 11 Nov 2024 20:34:30 +0100
On 32-bit platforms, it is possible for the expression `len + old_addr <
old_end` to be false-positive if `len + old_addr` wraps around.
`old_addr` is the cursor in the old range up to which page table entries
have been moved; so if the operation succeeded, `old_addr` is the *end* of
the old region, and adding `len` to it can wrap.
The overflow causes mremap() to mistakenly believe that PTEs have been
copied; the consequence is that mremap() bails out, but doesn't move the
PTEs back before the new VMA is unmapped, causing anonymous pages in the
region to be lost. So basically if userspace tries to mremap() a
private-anon region and hits this bug, mremap() will return an error and
the private-anon region's contents appear to have been zeroed.
The idea of this check is that `old_end - len` is the original start
address, and writing the check that way also makes it easier to read; so
fix the check by rearranging the comparison accordingly.
(An alternate fix would be to refactor this function by introducing an
"orig_old_start" variable or such.)
Tested in a VM with a 32-bit X86 kernel; without the patch:
```
user@horn:~/big_mremap$ cat test.c
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>
#define ADDR1 ((void*)0x60000000)
#define ADDR2 ((void*)0x10000000)
#define SIZE 0x50000000uL
int main(void) {
unsigned char *p1 = mmap(ADDR1, SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p1 == MAP_FAILED)
err(1, "mmap 1");
unsigned char *p2 = mmap(ADDR2, SIZE, PROT_NONE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p2 == MAP_FAILED)
err(1, "mmap 2");
*p1 = 0x41;
printf("first char is 0x%02hhx\n", *p1);
unsigned char *p3 = mremap(p1, SIZE, SIZE,
MREMAP_MAYMOVE|MREMAP_FIXED, p2);
if (p3 == MAP_FAILED) {
printf("mremap() failed; first char is 0x%02hhx\n", *p1);
} else {
printf("mremap() succeeded; first char is 0x%02hhx\n", *p3);
}
}
user@horn:~/big_mremap$ gcc -static -o test test.c
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() failed; first char is 0x00
```
With the patch:
```
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() succeeded; first char is 0x41
```
Link: https://lkml.kernel.org/r/20241111-fix-mremap-32bit-wrap-v1-1-61d6be73b722@…
Fixes: af8ca1c14906 ("mm/mremap: optimize the start addresses in move_page_tables()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Acked-by: Qi Zheng <zhengqi.arch(a)bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mremap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/mremap.c~mm-mremap-fix-address-wraparound-in-move_page_tables
+++ a/mm/mremap.c
@@ -648,7 +648,7 @@ again:
* Prevent negative return values when {old,new}_addr was realigned
* but we broke out of the above loop for the first PMD itself.
*/
- if (len + old_addr < old_end)
+ if (old_addr < old_end - len)
return 0;
return len + old_addr - old_end; /* how much done */
_
Patches currently in -mm which might be from jannh(a)google.com are
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f8f931bba0f92052cf842b7e30917b1afcc77d5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024111106-employer-bulgur-4f6d@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f8f931bba0f92052cf842b7e30917b1afcc77d5a Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd(a)google.com>
Date: Sun, 27 Oct 2024 13:02:13 -0700
Subject: [PATCH] mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Usama Arif <usamaarif642(a)gmail.com>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a1d345f1680c..03fd4bc39ea1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
return split_huge_page_to_list_to_order(&folio->page, list, ret);
}
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
{
struct deferred_split *ds_queue;
unsigned long flags;
+ bool unqueued = false;
+
+ WARN_ON_ONCE(folio_ref_count(folio));
+ WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
ds_queue = get_deferred_split_queue(folio);
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
list_del_init(&folio->_deferred_list);
+ unqueued = true;
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+ return unqueued; /* useful for debug warnings */
}
/* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
return;
/*
- * The try_to_unmap() in page reclaim path might reach here too,
- * this may cause a race condition to corrupt deferred split queue.
- * And, if page reclaim is already handling the same folio, it is
- * unnecessary to handle it again in shrinker.
- *
- * Check the swapcache flag to determine if the folio is being
- * handled by page reclaim since THP swap would add the folio into
- * swap cache before calling try_to_unmap().
+ * Exclude swapcache: originally to avoid a corrupt deferred split
+ * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
+ * but if page reclaim is already handling the same folio, it is
+ * unnecessary to handle it again in the shrinker, so excluding
+ * swapcache here may still be a useful optimization.
*/
if (folio_test_swapcache(folio))
return;
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..16c1f3cd599e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
#endif
}
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
{
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
- return;
+ return false;
/*
* At this point, there is no one trying to add the folio to
@@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
* to check without acquiring the split_queue_lock.
*/
if (data_race(list_empty(&folio->_deferred_list)))
- return;
+ return false;
- __folio_undo_large_rmappable(folio);
+ return __folio_unqueue_deferred_split(folio);
}
static inline struct folio *page_rmappable_folio(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 81d8819f13cd..f8744f5630bb 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
css_get(&to->css);
css_put(&from->css);
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = (unsigned long)to;
__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
enum mc_target_type target_type;
union mc_target target;
struct folio *folio;
+ bool tried_split_before = false;
+retry_pmd:
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
if (target_type == MC_TARGET_PAGE) {
folio = target.folio;
+ /*
+ * Deferred split queue locking depends on memcg,
+ * and unqueue is unsafe unless folio refcount is 0:
+ * split or skip if on the queue? first try to split.
+ */
+ if (!list_empty(&folio->_deferred_list)) {
+ spin_unlock(ptl);
+ if (!tried_split_before)
+ split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ if (tried_split_before)
+ return 0;
+ tried_split_before = true;
+ goto retry_pmd;
+ }
+ /*
+ * So long as that pmd lock is held, the folio cannot
+ * be racily added to the _deferred_list, because
+ * __folio_remove_rmap() will find !partially_mapped.
+ */
if (folio_isolate_lru(folio)) {
if (!mem_cgroup_move_account(folio, true,
mc.from, mc.to)) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2703227cce88..06df2af97415 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
- VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
- !folio_test_hugetlb(folio) &&
- !list_empty(&folio->_deferred_list), folio);
/*
* Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
ug->nr_memory += nr_pages;
ug->pgpgout++;
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = 0;
}
@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
/* Transfer the charge and the css ref */
commit_charge(new, memcg);
+
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(old));
old->memcg_data = 0;
}
@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
VM_BUG_ON_FOLIO(oldid, folio);
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
+ folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
if (!mem_cgroup_is_root(memcg))
diff --git a/mm/migrate.c b/mm/migrate.c
index fab84a776088..dfa24e41e8f9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -490,7 +490,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
folio_test_large_rmappable(folio)) {
if (!folio_ref_freeze(folio, expected_count))
return -EAGAIN;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
folio_ref_unfreeze(folio, expected_count);
}
@@ -515,7 +515,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
}
/* Take off deferred split queue while frozen and memcg set */
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
/*
* Now we know that no one else is looking at the folio:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e108ae755cc..8ad38cd5e574 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios)
unsigned long pfn = folio_pfn(folio);
unsigned int order = folio_order(folio);
- folio_undo_large_rmappable(folio);
if (!free_pages_prepare(&folio->page, order))
continue;
/*
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..b8e3259ea2c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
}
page_cache_release(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
mem_cgroup_uncharge(folio);
free_unref_page(&folio->page, folio_order(folio));
}
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
free_huge_folio(folio);
continue;
}
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
__page_cache_release(folio, &lruvec, &flags);
if (j != i)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddaaff67642e..28ba2b06fc7d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1476,7 +1476,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
*/
nr_reclaimed += nr_pages;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
mem_cgroup_uncharge_folios(&free_folios);
try_to_unmap_flush();
@@ -1864,7 +1864,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
if (unlikely(folio_put_testzero(folio))) {
__folio_clear_lru_flags(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_folios(&free_folios);
Following series is a backport of CVE-2024-47674 fix "mm: avoid leaving
partial pfn mappings around in error case" to 5.10.
This required 3 extra commits to make sure all picks were clean. The
patchset shows no regression compared to v5.4.285 tag.
Alex Zhang (1):
mm/memory.c: make remap_pfn_range() reject unaligned addr
Christoph Hellwig (1):
mm: add remap_pfn_range_notrack
WANG Wenhu (1):
mm: clarify a confusing comment for remap_pfn_range()
chenqiwu (1):
mm: fix ambiguous comments for better code readability
include/linux/mm.h | 2 ++
include/linux/mm_types.h | 4 +--
mm/memory.c | 54 +++++++++++++++++++++++++---------------
3 files changed, 38 insertions(+), 22 deletions(-)
--
2.46.0