The quilt patch titled
Subject: sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
has been removed from the -mm tree. Its filename was
sched-task_stack-fix-object_is_on_stack-for-kasan-tagged-pointers.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Qun-Wei Lin <qun-wei.lin(a)mediatek.com>
Subject: sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
Date: Wed, 13 Nov 2024 12:25:43 +0800
When CONFIG_KASAN_SW_TAGS and CONFIG_KASAN_STACK are enabled, the
object_is_on_stack() function may produce incorrect results due to the
presence of tags in the obj pointer, while the stack pointer does not have
tags. This discrepancy can lead to incorrect stack object detection and
subsequently trigger warnings if CONFIG_DEBUG_OBJECTS is also enabled.
Example of the warning:
ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:557 __debug_object_init+0x330/0x364
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc5 #4
Hardware name: linux,dummy-virt (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __debug_object_init+0x330/0x364
lr : __debug_object_init+0x330/0x364
sp : ffff800082ea7b40
x29: ffff800082ea7b40 x28: 98ff0000c0164518 x27: 98ff0000c0164534
x26: ffff800082d93ec8 x25: 0000000000000001 x24: 1cff0000c00172a0
x23: 0000000000000000 x22: ffff800082d93ed0 x21: ffff800081a24418
x20: 3eff800082ea7bb0 x19: efff800000000000 x18: 0000000000000000
x17: 00000000000000ff x16: 0000000000000047 x15: 206b63617473206e
x14: 0000000000000018 x13: ffff800082ea7780 x12: 0ffff800082ea78e
x11: 0ffff800082ea790 x10: 0ffff800082ea79d x9 : 34d77febe173e800
x8 : 34d77febe173e800 x7 : 0000000000000001 x6 : 0000000000000001
x5 : feff800082ea74b8 x4 : ffff800082870a90 x3 : ffff80008018d3c4
x2 : 0000000000000001 x1 : ffff800082858810 x0 : 0000000000000050
Call trace:
__debug_object_init+0x330/0x364
debug_object_init_on_stack+0x30/0x3c
schedule_hrtimeout_range_clock+0xac/0x26c
schedule_hrtimeout+0x1c/0x30
wait_task_inactive+0x1d4/0x25c
kthread_bind_mask+0x28/0x98
init_rescuer+0x1e8/0x280
workqueue_init+0x1a0/0x3cc
kernel_init_freeable+0x118/0x200
kernel_init+0x28/0x1f0
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated.
------------[ cut here ]------------
Link: https://lkml.kernel.org/r/20241113042544.19095-1-qun-wei.lin@mediatek.com
Signed-off-by: Qun-Wei Lin <qun-wei.lin(a)mediatek.com>
Cc: Andrew Yang <andrew.yang(a)mediatek.com>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
Cc: Casper Li <casper.li(a)mediatek.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Chinwen Chang <chinwen.chang(a)mediatek.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/sched/task_stack.h | 2 ++
1 file changed, 2 insertions(+)
--- a/include/linux/sched/task_stack.h~sched-task_stack-fix-object_is_on_stack-for-kasan-tagged-pointers
+++ a/include/linux/sched/task_stack.h
@@ -9,6 +9,7 @@
#include <linux/sched.h>
#include <linux/magic.h>
#include <linux/refcount.h>
+#include <linux/kasan.h>
#ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -89,6 +90,7 @@ static inline int object_is_on_stack(con
{
void *stack = task_stack_page(current);
+ obj = kasan_reset_tag(obj);
return (obj >= stack) && (obj < (stack + THREAD_SIZE));
}
_
Patches currently in -mm which might be from qun-wei.lin(a)mediatek.com are
The quilt patch titled
Subject: crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
has been removed from the -mm tree. Its filename was
crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dave Vasilevsky <dave(a)vasilevsky.ca>
Subject: crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
Date: Tue, 17 Sep 2024 12:37:20 -0400
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware.
On these machines, the kernel refuses to boot from non-zero
PHYSICAL_START, which occurs when CRASH_DUMP is on.
Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should
default to off for them. Users booting via some other mechanism can still
turn it on explicitly.
Does not change the default on any other architectures for the
time being.
Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca
Fixes: 75bc255a7444 ("crash: clean up kdump related config items")
Signed-off-by: Dave Vasilevsky <dave(a)vasilevsky.ca>
Reported-by: Reimar D��ffinger <Reimar.Doeffinger(a)gmx.de>
Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html
Acked-by: Michael Ellerman <mpe(a)ellerman.id.au> [powerpc]
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: John Paul Adrian Glaubitz <glaubitz(a)physik.fu-berlin.de>
Cc: Reimar D��ffinger <Reimar.Doeffinger(a)gmx.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm/Kconfig | 3 +++
arch/arm64/Kconfig | 3 +++
arch/loongarch/Kconfig | 3 +++
arch/mips/Kconfig | 3 +++
arch/powerpc/Kconfig | 4 ++++
arch/riscv/Kconfig | 3 +++
arch/s390/Kconfig | 3 +++
arch/sh/Kconfig | 3 +++
arch/x86/Kconfig | 3 +++
kernel/Kconfig.kexec | 2 +-
10 files changed, 29 insertions(+), 1 deletion(-)
--- a/arch/arm64/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/arm64/Kconfig
@@ -1576,6 +1576,9 @@ config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_S
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_RESERVE
--- a/arch/arm/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/arm/Kconfig
@@ -1598,6 +1598,9 @@ config ATAGS_PROC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config AUTO_ZRELADDR
bool "Auto calculation of the decompressed kernel image address" if !ARCH_MULTIPLATFORM
default !(ARCH_FOOTBRIDGE || ARCH_RPC || ARCH_SA1100)
--- a/arch/loongarch/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/loongarch/Kconfig
@@ -604,6 +604,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
--- a/arch/mips/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/mips/Kconfig
@@ -2876,6 +2876,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded"
default "0xffffffff84000000"
--- a/arch/powerpc/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/powerpc/Kconfig
@@ -684,6 +684,10 @@ config RELOCATABLE_TEST
config ARCH_SUPPORTS_CRASH_DUMP
def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+config ARCH_DEFAULT_CRASH_DUMP
+ bool
+ default y if !PPC_BOOK3S_32
+
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
--- a/arch/riscv/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/riscv/Kconfig
@@ -898,6 +898,9 @@ config ARCH_SUPPORTS_KEXEC_PURGATORY
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_RESERVE
--- a/arch/s390/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/s390/Kconfig
@@ -276,6 +276,9 @@ config ARCH_SUPPORTS_CRASH_DUMP
This option also enables s390 zfcpdump.
See also <file:Documentation/arch/s390/zfcpdump.rst>
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
menu "Processor type and features"
config HAVE_MARCH_Z10_FEATURES
--- a/arch/sh/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/sh/Kconfig
@@ -550,6 +550,9 @@ config ARCH_SUPPORTS_KEXEC
config ARCH_SUPPORTS_CRASH_DUMP
def_bool BROKEN_ON_SMP
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y
--- a/arch/x86/Kconfig~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/arch/x86/Kconfig
@@ -2084,6 +2084,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
+config ARCH_DEFAULT_CRASH_DUMP
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_HOTPLUG
def_bool y
--- a/kernel/Kconfig.kexec~crash-powerpc-default-to-crash_dump=n-on-ppc_book3s_32
+++ a/kernel/Kconfig.kexec
@@ -97,7 +97,7 @@ config KEXEC_JUMP
config CRASH_DUMP
bool "kernel crash dumps"
- default y
+ default ARCH_DEFAULT_CRASH_DUMP
depends on ARCH_SUPPORTS_CRASH_DUMP
depends on KEXEC_CORE
select VMCORE_INFO
_
Patches currently in -mm which might be from dave(a)vasilevsky.ca are
The quilt patch titled
Subject: mm/mremap: fix address wraparound in move_page_tables()
has been removed from the -mm tree. Its filename was
mm-mremap-fix-address-wraparound-in-move_page_tables.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm/mremap: fix address wraparound in move_page_tables()
Date: Mon, 11 Nov 2024 20:34:30 +0100
On 32-bit platforms, it is possible for the expression `len + old_addr <
old_end` to be false-positive if `len + old_addr` wraps around.
`old_addr` is the cursor in the old range up to which page table entries
have been moved; so if the operation succeeded, `old_addr` is the *end* of
the old region, and adding `len` to it can wrap.
The overflow causes mremap() to mistakenly believe that PTEs have been
copied; the consequence is that mremap() bails out, but doesn't move the
PTEs back before the new VMA is unmapped, causing anonymous pages in the
region to be lost. So basically if userspace tries to mremap() a
private-anon region and hits this bug, mremap() will return an error and
the private-anon region's contents appear to have been zeroed.
The idea of this check is that `old_end - len` is the original start
address, and writing the check that way also makes it easier to read; so
fix the check by rearranging the comparison accordingly.
(An alternate fix would be to refactor this function by introducing an
"orig_old_start" variable or such.)
Tested in a VM with a 32-bit X86 kernel; without the patch:
```
user@horn:~/big_mremap$ cat test.c
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>
#define ADDR1 ((void*)0x60000000)
#define ADDR2 ((void*)0x10000000)
#define SIZE 0x50000000uL
int main(void) {
unsigned char *p1 = mmap(ADDR1, SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p1 == MAP_FAILED)
err(1, "mmap 1");
unsigned char *p2 = mmap(ADDR2, SIZE, PROT_NONE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p2 == MAP_FAILED)
err(1, "mmap 2");
*p1 = 0x41;
printf("first char is 0x%02hhx\n", *p1);
unsigned char *p3 = mremap(p1, SIZE, SIZE,
MREMAP_MAYMOVE|MREMAP_FIXED, p2);
if (p3 == MAP_FAILED) {
printf("mremap() failed; first char is 0x%02hhx\n", *p1);
} else {
printf("mremap() succeeded; first char is 0x%02hhx\n", *p3);
}
}
user@horn:~/big_mremap$ gcc -static -o test test.c
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() failed; first char is 0x00
```
With the patch:
```
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() succeeded; first char is 0x41
```
Link: https://lkml.kernel.org/r/20241111-fix-mremap-32bit-wrap-v1-1-61d6be73b722@…
Fixes: af8ca1c14906 ("mm/mremap: optimize the start addresses in move_page_tables()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Acked-by: Qi Zheng <zhengqi.arch(a)bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mremap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/mremap.c~mm-mremap-fix-address-wraparound-in-move_page_tables
+++ a/mm/mremap.c
@@ -648,7 +648,7 @@ again:
* Prevent negative return values when {old,new}_addr was realigned
* but we broke out of the above loop for the first PMD itself.
*/
- if (len + old_addr < old_end)
+ if (old_addr < old_end - len)
return 0;
return len + old_addr - old_end; /* how much done */
_
Patches currently in -mm which might be from jannh(a)google.com are
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f8f931bba0f92052cf842b7e30917b1afcc77d5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024111106-employer-bulgur-4f6d@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f8f931bba0f92052cf842b7e30917b1afcc77d5a Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd(a)google.com>
Date: Sun, 27 Oct 2024 13:02:13 -0700
Subject: [PATCH] mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Usama Arif <usamaarif642(a)gmail.com>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a1d345f1680c..03fd4bc39ea1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
return split_huge_page_to_list_to_order(&folio->page, list, ret);
}
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
{
struct deferred_split *ds_queue;
unsigned long flags;
+ bool unqueued = false;
+
+ WARN_ON_ONCE(folio_ref_count(folio));
+ WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
ds_queue = get_deferred_split_queue(folio);
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
list_del_init(&folio->_deferred_list);
+ unqueued = true;
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+ return unqueued; /* useful for debug warnings */
}
/* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
return;
/*
- * The try_to_unmap() in page reclaim path might reach here too,
- * this may cause a race condition to corrupt deferred split queue.
- * And, if page reclaim is already handling the same folio, it is
- * unnecessary to handle it again in shrinker.
- *
- * Check the swapcache flag to determine if the folio is being
- * handled by page reclaim since THP swap would add the folio into
- * swap cache before calling try_to_unmap().
+ * Exclude swapcache: originally to avoid a corrupt deferred split
+ * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
+ * but if page reclaim is already handling the same folio, it is
+ * unnecessary to handle it again in the shrinker, so excluding
+ * swapcache here may still be a useful optimization.
*/
if (folio_test_swapcache(folio))
return;
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..16c1f3cd599e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
#endif
}
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
{
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
- return;
+ return false;
/*
* At this point, there is no one trying to add the folio to
@@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
* to check without acquiring the split_queue_lock.
*/
if (data_race(list_empty(&folio->_deferred_list)))
- return;
+ return false;
- __folio_undo_large_rmappable(folio);
+ return __folio_unqueue_deferred_split(folio);
}
static inline struct folio *page_rmappable_folio(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 81d8819f13cd..f8744f5630bb 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
css_get(&to->css);
css_put(&from->css);
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = (unsigned long)to;
__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
enum mc_target_type target_type;
union mc_target target;
struct folio *folio;
+ bool tried_split_before = false;
+retry_pmd:
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
if (target_type == MC_TARGET_PAGE) {
folio = target.folio;
+ /*
+ * Deferred split queue locking depends on memcg,
+ * and unqueue is unsafe unless folio refcount is 0:
+ * split or skip if on the queue? first try to split.
+ */
+ if (!list_empty(&folio->_deferred_list)) {
+ spin_unlock(ptl);
+ if (!tried_split_before)
+ split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ if (tried_split_before)
+ return 0;
+ tried_split_before = true;
+ goto retry_pmd;
+ }
+ /*
+ * So long as that pmd lock is held, the folio cannot
+ * be racily added to the _deferred_list, because
+ * __folio_remove_rmap() will find !partially_mapped.
+ */
if (folio_isolate_lru(folio)) {
if (!mem_cgroup_move_account(folio, true,
mc.from, mc.to)) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2703227cce88..06df2af97415 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
- VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
- !folio_test_hugetlb(folio) &&
- !list_empty(&folio->_deferred_list), folio);
/*
* Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
ug->nr_memory += nr_pages;
ug->pgpgout++;
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = 0;
}
@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
/* Transfer the charge and the css ref */
commit_charge(new, memcg);
+
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(old));
old->memcg_data = 0;
}
@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
VM_BUG_ON_FOLIO(oldid, folio);
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
+ folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
if (!mem_cgroup_is_root(memcg))
diff --git a/mm/migrate.c b/mm/migrate.c
index fab84a776088..dfa24e41e8f9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -490,7 +490,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
folio_test_large_rmappable(folio)) {
if (!folio_ref_freeze(folio, expected_count))
return -EAGAIN;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
folio_ref_unfreeze(folio, expected_count);
}
@@ -515,7 +515,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
}
/* Take off deferred split queue while frozen and memcg set */
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
/*
* Now we know that no one else is looking at the folio:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e108ae755cc..8ad38cd5e574 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios)
unsigned long pfn = folio_pfn(folio);
unsigned int order = folio_order(folio);
- folio_undo_large_rmappable(folio);
if (!free_pages_prepare(&folio->page, order))
continue;
/*
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..b8e3259ea2c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
}
page_cache_release(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
mem_cgroup_uncharge(folio);
free_unref_page(&folio->page, folio_order(folio));
}
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
free_huge_folio(folio);
continue;
}
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
__page_cache_release(folio, &lruvec, &flags);
if (j != i)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddaaff67642e..28ba2b06fc7d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1476,7 +1476,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
*/
nr_reclaimed += nr_pages;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
mem_cgroup_uncharge_folios(&free_folios);
try_to_unmap_flush();
@@ -1864,7 +1864,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
if (unlikely(folio_put_testzero(folio))) {
__folio_clear_lru_flags(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_folios(&free_folios);
Following series is a backport of CVE-2024-47674 fix "mm: avoid leaving
partial pfn mappings around in error case" to 5.10.
This required 3 extra commits to make sure all picks were clean. The
patchset shows no regression compared to v5.4.285 tag.
Alex Zhang (1):
mm/memory.c: make remap_pfn_range() reject unaligned addr
Christoph Hellwig (1):
mm: add remap_pfn_range_notrack
WANG Wenhu (1):
mm: clarify a confusing comment for remap_pfn_range()
chenqiwu (1):
mm: fix ambiguous comments for better code readability
include/linux/mm.h | 2 ++
include/linux/mm_types.h | 4 +--
mm/memory.c | 54 +++++++++++++++++++++++++---------------
3 files changed, 38 insertions(+), 22 deletions(-)
--
2.46.0
I am running into this compile error with Linux kernel 5.15.171 in OpenWrt on 32 bit systems.
```
fs/udf/namei.c: In function 'udf_rename':
fs/udf/namei.c:878:1: error: the frame size of 1144 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
878 | }
| ^
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:289: fs/udf/namei.o] Error 1
make[1]: *** [scripts/Makefile.build:552: fs/udf] Error 2
```
This problem was introduced with kernel 5.15.169.
The first patch needs an extra linux/slab.h include on x86, which is the only modification I did to it compared to the upstream version.
These patches should go into 5.15. They were already backported to kernel 6.1.
Jan Kara (2):
udf: Allocate name buffer in directory iterator on heap
udf: Avoid directory type conversion failure due to ENOMEM
fs/udf/directory.c | 27 +++++++++++++++++++--------
fs/udf/udfdecl.h | 2 +-
2 files changed, 20 insertions(+), 9 deletions(-)
--
2.47.0
From: Yuanzheng Song <songyuanzheng(a)huawei.com>
The vma->anon_vma of the child process may be NULL because
the entire vma does not contain anonymous pages. In this
case, a BUG will occur when the copy_present_page() passes
a copy of a non-anonymous page of that vma to the
page_add_new_anon_rmap() to set up new anonymous rmap.
------------[ cut here ]------------
kernel BUG at mm/rmap.c:1052!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 4 PID: 4652 Comm: test Not tainted 5.15.75 #1
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __page_set_anon_rmap+0xc0/0xe8
lr : __page_set_anon_rmap+0xc0/0xe8
sp : ffff80000e773860
x29: ffff80000e773860 x28: fffffc13cf006ec0 x27: ffff04f3ccd68000
x26: ffff04f3c5c33248 x25: 0000000010100073 x24: ffff04f3c53c0a80
x23: 0000000020000000 x22: 0000000000000001 x21: 0000000020000000
x20: fffffc13cf006ec0 x19: 0000000000000000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdddc5581377c
x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff2717a8433000
x5 : ffff80000e773810 x4 : ffffdddc55400000 x3 : 0000000000000000
x2 : ffffdddc56b20000 x1 : ffff04f3c9a48040 x0 : 0000000000000000
Call trace:
__page_set_anon_rmap+0xc0/0xe8
page_add_new_anon_rmap+0x13c/0x200
copy_pte_range+0x6b8/0x1018
copy_page_range+0x3a8/0x5e0
dup_mmap+0x3a0/0x6e8
dup_mm+0x78/0x140
copy_process+0x1528/0x1b08
kernel_clone+0xac/0x610
__do_sys_clone+0x78/0xb0
__arm64_sys_clone+0x30/0x40
invoke_syscall+0x68/0x170
el0_svc_common.constprop.0+0x80/0x250
do_el0_svc+0x48/0xb8
el0_svc+0x48/0x1a8
el0t_64_sync_handler+0xb0/0xb8
el0t_64_sync+0x1a0/0x1a4
Code: 97f899f4 f9400273 17ffffeb 97f899f1 (d4210000)
---[ end trace dc65e5edd0f362fa ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: 0x5ddc4d400000 from 0xffff800008000000
PHYS_OFFSET: 0xfffffb0c80000000
CPU features: 0x44000cf1,00000806
Memory Limit: none
---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
This problem has been fixed by the commit <fb3d824d1a46>
("mm/rmap: split page_dup_rmap() into page_dup_file_rmap()
and page_try_dup_anon_rmap()"), but still exists in the
linux-5.15.y branch.
This patch is not applicable to this version because
of the large version differences. Therefore, fix it by
adding non-anonymous page check in the copy_present_page().
Cc: stable(a)vger.kernel.org
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
Signed-off-by: Yuanzheng Song <songyuanzheng(a)huawei.com>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
---
Hi, this was posted in [1] but seems stable@ was not actually included
in the recipients.
The 5.10 version [2] was applied as 935a8b62021 but 5.15 is missing.
[1] https://lore.kernel.org/all/20221028075244.3112566-1-songyuanzheng@huawei.c…
[2] https://lore.kernel.org/all/20221028030705.2840539-1-songyuanzheng@huawei.c…
mm/memory.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 6d058973a97e..4785aecca9a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -903,6 +903,17 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
if (likely(!page_needs_cow_for_dma(src_vma, page)))
return 1;
+ /*
+ * The vma->anon_vma of the child process may be NULL
+ * because the entire vma does not contain anonymous pages.
+ * A BUG will occur when the copy_present_page() passes
+ * a copy of a non-anonymous page of that vma to the
+ * page_add_new_anon_rmap() to set up new anonymous rmap.
+ * Return 1 if the page is not an anonymous page.
+ */
+ if (!PageAnon(page))
+ return 1;
+
new_page = *prealloc;
if (!new_page)
return -EAGAIN;
--
2.47.0
commit 73254a297c2dd094abec7c9efee32455ae875bdf upstream.
The io_register_iowq_max_workers() function calls io_put_sq_data(),
which acquires the sqd->lock without releasing the uring_lock.
Similar to the commit 009ad9f0c6ee ("io_uring: drop ctx->uring_lock
before acquiring sqd->lock"), this can lead to a potential deadlock
situation.
To resolve this issue, the uring_lock is released before calling
io_put_sq_data(), and then it is re-acquired after the function call.
This change ensures that the locks are acquired in the correct
order, preventing the possibility of a deadlock.
Suggested-by: Maximilian Heyne <mheyne(a)amazon.de>
Signed-off-by: Hagar Hemdan <hagarhem(a)amazon.com>
Link: https://lore.kernel.org/r/20240604130527.3597-1-hagarhem@amazon.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
[Hagar: Modified to apply on v6.1]
Signed-off-by: Hagar Hemdan <hagarhem(a)amazon.com>
---
io_uring/io_uring.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 92c1aa8f3501..4f0ae938b146 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3921,8 +3921,10 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
}
if (sqd) {
+ mutex_unlock(&ctx->uring_lock);
mutex_unlock(&sqd->lock);
io_put_sq_data(sqd);
+ mutex_lock(&ctx->uring_lock);
}
if (copy_to_user(arg, new_count, sizeof(new_count)))
@@ -3947,8 +3949,11 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
return 0;
err:
if (sqd) {
+ mutex_unlock(&ctx->uring_lock);
mutex_unlock(&sqd->lock);
io_put_sq_data(sqd);
+ mutex_lock(&ctx->uring_lock);
+
}
return ret;
}
--
2.40.1