In of_unittest_pci_node_verify(), when the add parameter is false,
device_find_any_child() obtains a reference to a child device. This
function implicitly calls get_device() to increment the device's
reference count before returning the pointer. However, the caller
fails to properly release this reference by calling put_device(),
leading to a device reference count leak. Add put_device() in the else
branch immediately after child_dev is no longer needed.
As the comment of device_find_any_child states: "NOTE: you will need
to drop the reference with put_device() after use".
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 26409dd04589 ("of: unittest: Add pci_dt_testdrv pci driver")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- modified the put_device() location as suggestions.
---
drivers/of/unittest.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index e3503ec20f6c..388e9ec2cccf 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -4300,6 +4300,7 @@ static int of_unittest_pci_node_verify(struct pci_dev *pdev, bool add)
unittest(!np, "Child device tree node is not removed\n");
child_dev = device_find_any_child(&pdev->dev);
unittest(!child_dev, "Child device is not removed\n");
+ put_device(child_dev);
}
failed:
--
2.17.1
Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while
the overlayfs is being unmounted, can lead to dereferencing NULL ptr.
This issue was found by syzkaller.
Race Condition Diagram:
Thread 1 Thread 2
-------- --------
generic_shutdown_super()
shrink_dcache_for_umount
sb->s_root = NULL
|
| vfs_read()
| inotify_fdinfo()
| * inode get from mark *
| show_mark_fhandle(m, inode)
| exportfs_encode_fid(inode, ..)
| ovl_encode_fh(inode, ..)
| ovl_check_encode_origin(inode)
| * deref i_sb->s_root *
|
|
v
fsnotify_sb_delete(sb)
Which then leads to:
[ 32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
[ 32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 #22 PREEMPT(none)
<snip registers, unreliable trace>
[ 32.143353] Call Trace:
[ 32.143732] ovl_encode_fh+0xd5/0x170
[ 32.144031] exportfs_encode_inode_fh+0x12f/0x300
[ 32.144425] show_mark_fhandle+0xbe/0x1f0
[ 32.145805] inotify_fdinfo+0x226/0x2d0
[ 32.146442] inotify_show_fdinfo+0x1c5/0x350
[ 32.147168] seq_show+0x530/0x6f0
[ 32.147449] seq_read_iter+0x503/0x12a0
[ 32.148419] seq_read+0x31f/0x410
[ 32.150714] vfs_read+0x1f0/0x9e0
[ 32.152297] ksys_read+0x125/0x240
IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set
to NULL in the unmount path.
Minimize the window of opportunity by adding explicit check.
Fixes: c45beebfde34 ("ovl: support encoding fid from inode with no alias")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: Miklos Szeredi <miklos(a)szeredi.hu>
Cc: Amir Goldstein <amir73il(a)gmail.com>
Cc: linux-unionfs(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
---
I'm happy to take suggestions for a better fix - I looked at taking
s_umount for reading, but it wasn't clear to me for how long would the
fdinfo path need to hold it. Hence the most primitive suggestion in this
v1.
I'm also not sure if ENOENT or EBUSY is better?.. or even something else?
fs/overlayfs/export.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index 83f80fdb1567..424c73188e06 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode)
if (!ovl_inode_lower(inode))
return 0;
+ if (!inode->i_sb->s_root)
+ return -ENOENT;
/*
* Root is never indexed, so if there's an upper layer, encode upper for
* root.
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
syzkaller discovered the following crash: (kernel BUG)
[ 44.607039] ------------[ cut here ]------------
[ 44.607422] kernel BUG at mm/userfaultfd.c:2067!
[ 44.608148] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 44.608814] CPU: 1 UID: 0 PID: 2475 Comm: reproducer Not tainted 6.16.0-rc6 #1 PREEMPT(none)
[ 44.609635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 44.610695] RIP: 0010:userfaultfd_release_all+0x3a8/0x460
<snip other registers, drop unreliable trace>
[ 44.617726] Call Trace:
[ 44.617926] <TASK>
[ 44.619284] userfaultfd_release+0xef/0x1b0
[ 44.620976] __fput+0x3f9/0xb60
[ 44.621240] fput_close_sync+0x110/0x210
[ 44.622222] __x64_sys_close+0x8f/0x120
[ 44.622530] do_syscall_64+0x5b/0x2f0
[ 44.622840] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 44.623244] RIP: 0033:0x7f365bb3f227
Kernel panics because it detects UFFD inconsistency during
userfaultfd_release_all(). Specifically, a VMA which has a valid pointer
to vma->vm_userfaultfd_ctx, but no UFFD flags in vma->vm_flags.
The inconsistency is caused in ksm_madvise(): when user calls madvise()
with MADV_UNMEARGEABLE on a VMA that is registered for UFFD in MINOR
mode, it accidentally clears all flags stored in the upper 32 bits of
vma->vm_flags.
Assuming x86_64 kernel build, unsigned long is 64-bit and unsigned int
and int are 32-bit wide. This setup causes the following mishap during
the &= ~VM_MERGEABLE assignment.
VM_MERGEABLE is a 32-bit constant of type unsigned int, 0x8000'0000.
After ~ is applied, it becomes 0x7fff'ffff unsigned int, which is then
promoted to unsigned long before the & operation. This promotion fills
upper 32 bits with leading 0s, as we're doing unsigned conversion (and
even for a signed conversion, this wouldn't help as the leading bit is
0). & operation thus ends up AND-ing vm_flags with 0x0000'0000'7fff'ffff
instead of intended 0xffff'ffff'7fff'ffff and hence accidentally clears
the upper 32-bits of its value.
Fix it by changing `VM_MERGEABLE` constant to unsigned long. Modify all
other VM_* flags constants for consistency.
Note: other VM_* flags are not affected:
This only happens to the VM_MERGEABLE flag, as the other VM_* flags are
all constants of type int and after ~ operation, they end up with
leading 1 and are thus converted to unsigned long with leading 1s.
Note 2:
After commit 31defc3b01d9 ("userfaultfd: remove (VM_)BUG_ON()s"), this is
no longer a kernel BUG, but a WARNING at the same place:
[ 45.595973] WARNING: CPU: 1 PID: 2474 at mm/userfaultfd.c:2067
but the root-cause (flag-drop) remains the same.
Fixes: 7677f7fd8be76 ("userfaultfd: add minor fault registration mode")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Xu Xin <xu.xin16(a)zte.com.cn>
Cc: Chengming Zhou <chengming.zhou(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: linux-mm(a)kvack.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
---
v1 -> v2:
- fix by adding ul to flag constants instead of explicit cast.
- drop Mike Kravetz <mike.kravetz(a)oracle.com> from cc, as the mail
returned
v1:
https://lore.kernel.org/all/20250930063921.62354-1-acsjakub@amazon.de/
include/linux/mm.h | 72 +++++++++++++++++++++++-----------------------
1 file changed, 36 insertions(+), 36 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..26a5c0f78b36 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -246,57 +246,57 @@ extern unsigned int kobjsize(const void *objp);
* vm_flags in vm_area_struct, see mm_types.h.
* When changing, update also include/trace/events/mmflags.h
*/
-#define VM_NONE 0x00000000
+#define VM_NONE 0x00000000ul
-#define VM_READ 0x00000001 /* currently active flags */
-#define VM_WRITE 0x00000002
-#define VM_EXEC 0x00000004
-#define VM_SHARED 0x00000008
+#define VM_READ 0x00000001ul /* currently active flags */
+#define VM_WRITE 0x00000002ul
+#define VM_EXEC 0x00000004ul
+#define VM_SHARED 0x00000008ul
/* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */
-#define VM_MAYREAD 0x00000010 /* limits for mprotect() etc */
-#define VM_MAYWRITE 0x00000020
-#define VM_MAYEXEC 0x00000040
-#define VM_MAYSHARE 0x00000080
+#define VM_MAYREAD 0x00000010ul /* limits for mprotect() etc */
+#define VM_MAYWRITE 0x00000020ul
+#define VM_MAYEXEC 0x00000040ul
+#define VM_MAYSHARE 0x00000080ul
-#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
+#define VM_GROWSDOWN 0x00000100ul /* general info on the segment */
#ifdef CONFIG_MMU
-#define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */
+#define VM_UFFD_MISSING 0x00000200ul /* missing pages tracking */
#else /* CONFIG_MMU */
-#define VM_MAYOVERLAY 0x00000200 /* nommu: R/O MAP_PRIVATE mapping that might overlay a file mapping */
-#define VM_UFFD_MISSING 0
+#define VM_MAYOVERLAY 0x00000200ul /* nommu: R/O MAP_PRIVATE mapping that might overlay a file mapping */
+#define VM_UFFD_MISSING 0ul
#endif /* CONFIG_MMU */
-#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
-#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
+#define VM_PFNMAP 0x00000400ul /* Page-ranges managed without "struct page", just pure PFN */
+#define VM_UFFD_WP 0x00001000ul /* wrprotect pages tracking */
-#define VM_LOCKED 0x00002000
-#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
+#define VM_LOCKED 0x00002000ul
+#define VM_IO 0x00004000ul /* Memory mapped I/O or similar */
/* Used by sys_madvise() */
-#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
-#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */
-
-#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
-#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
-#define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */
-#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
-#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
-#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
-#define VM_SYNC 0x00800000 /* Synchronous page faults */
-#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
-#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
-#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
+#define VM_SEQ_READ 0x00008000ul /* App will access data sequentially */
+#define VM_RAND_READ 0x00010000ul /* App will not benefit from clustered reads */
+
+#define VM_DONTCOPY 0x00020000ul /* Do not copy this vma on fork */
+#define VM_DONTEXPAND 0x00040000ul /* Cannot expand with mremap() */
+#define VM_LOCKONFAULT 0x00080000ul /* Lock the pages covered when they are faulted in */
+#define VM_ACCOUNT 0x00100000ul /* Is a VM accounted object */
+#define VM_NORESERVE 0x00200000ul /* should the VM suppress accounting */
+#define VM_HUGETLB 0x00400000ul /* Huge TLB Page VM */
+#define VM_SYNC 0x00800000ul /* Synchronous page faults */
+#define VM_ARCH_1 0x01000000ul /* Architecture-specific flag */
+#define VM_WIPEONFORK 0x02000000ul /* Wipe VMA contents in child. */
+#define VM_DONTDUMP 0x04000000ul /* Do not include in the core dump */
#ifdef CONFIG_MEM_SOFT_DIRTY
-# define VM_SOFTDIRTY 0x08000000 /* Not soft dirty clean area */
+# define VM_SOFTDIRTY 0x08000000ul /* Not soft dirty clean area */
#else
-# define VM_SOFTDIRTY 0
+# define VM_SOFTDIRTY 0ul
#endif
-#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
-#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
-#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
-#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+#define VM_MIXEDMAP 0x10000000ul /* Can contain "struct page" and pure PFN pages */
+#define VM_HUGEPAGE 0x20000000ul /* MADV_HUGEPAGE marked this vma */
+#define VM_NOHUGEPAGE 0x40000000ul /* MADV_NOHUGEPAGE marked this vma */
+#define VM_MERGEABLE 0x80000000ul /* KSM may merge identical pages */
#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
syzbot reported use-after-free bugs when accessing extent headers in
ext4_ext_insert_extent() and ext4_ext_correct_indexes(). These occur
when the extent path structure becomes invalid during operations.
The crashes show two patterns:
1. In ext4_ext_map_blocks(), the extent header can be corrupted after
ext4_find_extent() returns, particularly during concurrent writes
to the same file.
2. In ext4_ext_correct_indexes(), accessing path[depth] causes a
use-after-free, indicating the path structure itself is corrupted.
This is partially exposed by commit 665575cff098 ("filemap: move
prefaulting out of hot write path") which changed timing windows in
the write path, making these races more likely to occur.
Fix this by adding validation checks:
- In ext4_ext_map_blocks(): validate the extent header after getting
the path from ext4_find_extent()
- In ext4_ext_correct_indexes(): validate the path pointer before
dereferencing and check extent header magic
While these checks are defensive and don't address the root cause of
path corruption, they prevent kernel crashes from invalid memory access.
A more comprehensive fix to path lifetime management may be needed in
the future.
Reported-by: syzbot+9db318d6167044609878(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9db318d6167044609878
Fixes: 665575cff098 ("filemap: move prefaulting out of hot write path")
Cc: stable(a)vger.kernel.org
Signed-off-by: Deepanshu Kartikey <kartikey406(a)gmail.com>
---
fs/ext4/extents.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca5499e9412b..903578d5f68d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1708,7 +1708,9 @@ static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
struct ext4_extent *ex;
__le32 border;
int k, err = 0;
-
+ if (!path || depth < 0 || depth > EXT4_MAX_EXTENT_DEPTH) {
+ return -EFSCORRUPTED;
+ }
eh = path[depth].p_hdr;
ex = path[depth].p_ext;
@@ -4200,6 +4202,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
unsigned int allocated_clusters = 0;
struct ext4_allocation_request ar;
ext4_lblk_t cluster_offset;
+ struct ext4_extent_header *eh;
ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len);
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
@@ -4212,7 +4215,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
depth = ext_depth(inode);
-
+ eh = path[depth].p_hdr;
+ if (!eh || le16_to_cpu(eh->eh_magic) != EXT4_EXT_MAGIC) {
+ EXT4_ERROR_INODE(inode, "invalid extent header after find_extent");
+ err = -EFSCORRUPTED;
+ goto out;
+ }
/*
* consistent leaf must not be empty;
* this situation is possible, though, _during_ tree modification;
--
2.43.0
The patch titled
Subject: mm/damon/vaddr: do not repeat pte_offset_map_lock() until success
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-vaddr-do-not-repeat-pte_offset_map_lock-until-success.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/vaddr: do not repeat pte_offset_map_lock() until success
Date: Mon, 29 Sep 2025 17:44:09 -0700
DAMON's virtual address space operation set implementation (vaddr) calls
pte_offset_map_lock() inside the page table walk callback function. This
is for reading and writing page table accessed bits. If
pte_offset_map_lock() fails, it retries by returning the page table walk
callback function with ACTION_AGAIN.
pte_offset_map_lock() can continuously fail if the target is a pmd
migration entry, though. Hence it could cause an infinite page table walk
if the migration cannot be done until the page table walk is finished.
This indeed caused a soft lockup when CPU hotplugging and DAMON were
running in parallel.
Avoid the infinite loop by simply not retrying the page table walk. DAMON
is promising only a best-effort accuracy, so missing access to such pages
is no problem.
Link: https://lkml.kernel.org/r/20250930004410.55228-1-sj@kernel.org
Fixes: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reported-by: Xinyu Zheng <zhengxinyu6(a)huawei.com>
Closes: https://lore.kernel.org/20250918030029.2652607-1-zhengxinyu6@huawei.com
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/vaddr.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
--- a/mm/damon/vaddr.c~mm-damon-vaddr-do-not-repeat-pte_offset_map_lock-until-success
+++ a/mm/damon/vaddr.c
@@ -328,10 +328,8 @@ static int damon_mkold_pmd_entry(pmd_t *
}
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- if (!pte) {
- walk->action = ACTION_AGAIN;
+ if (!pte)
return 0;
- }
if (!pte_present(ptep_get(pte)))
goto out;
damon_ptep_mkold(pte, walk->vma, addr);
@@ -481,10 +479,8 @@ regular_page:
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- if (!pte) {
- walk->action = ACTION_AGAIN;
+ if (!pte)
return 0;
- }
ptent = ptep_get(pte);
if (!pte_present(ptent))
goto out;
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-vaddr-do-not-repeat-pte_offset_map_lock-until-success.patch
The patch titled
Subject: mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-rmap-fix-soft-dirty-and-uffd-wp-bit-loss-when-remapping-zero-filled-mthp-subpage-to-shared-zeropage.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lance Yang <lance.yang(a)linux.dev>
Subject: mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage
Date: Tue, 30 Sep 2025 16:10:40 +0800
When splitting an mTHP and replacing a zero-filled subpage with the shared
zeropage, try_to_map_unused_to_zeropage() currently drops several
important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for
incremental snapshots, losing the soft-dirty bit means modified pages are
missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped.
This breaks the userfaultfd write-protection mechanism, causing writes to
be silently missed by monitoring applications, which can lead to data
corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when
creating the new zeropage mapping to ensure they are correctly tracked.
Link: https://lkml.kernel.org/r/20250930081040.80926-1-lance.yang@linux.dev
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Suggested-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: Zi Yan <ziy(a)nvidia.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Byungchul Park <byungchul(a)sk.com>
Cc: Gregory Price <gourry(a)gourry.net>
Cc: "Huang, Ying" <ying.huang(a)linux.alibaba.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Joshua Hahn <joshua.hahnjy(a)gmail.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Mariano Pache <npache(a)redhat.com>
Cc: Mathew Brost <matthew.brost(a)intel.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Rakie Kim <rakie.kim(a)sk.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Usama Arif <usamaarif642(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
--- a/mm/migrate.c~mm-rmap-fix-soft-dirty-and-uffd-wp-bit-loss-when-remapping-zero-filled-mthp-subpage-to-shared-zeropage
+++ a/mm/migrate.c
@@ -297,8 +297,7 @@ bool isolate_folio_to_list(struct folio
}
static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
- struct folio *folio,
- unsigned long idx)
+ struct folio *folio, pte_t old_pte, unsigned long idx)
{
struct page *page = folio_page(folio, idx);
pte_t newpte;
@@ -307,7 +306,7 @@ static bool try_to_map_unused_to_zeropag
return false;
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
+ VM_BUG_ON_PAGE(pte_present(old_pte), page);
if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
mm_forbids_zeropage(pvmw->vma->vm_mm))
@@ -323,6 +322,12 @@ static bool try_to_map_unused_to_zeropag
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
pvmw->vma->vm_page_prot));
+
+ if (pte_swp_soft_dirty(old_pte))
+ newpte = pte_mksoft_dirty(newpte);
+ if (pte_swp_uffd_wp(old_pte))
+ newpte = pte_mkuffd_wp(newpte);
+
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
@@ -365,13 +370,13 @@ static bool remove_migration_pte(struct
continue;
}
#endif
+ old_pte = ptep_get(pvmw.pte);
if (rmap_walk_arg->map_unused_to_zeropage &&
- try_to_map_unused_to_zeropage(&pvmw, folio, idx))
+ try_to_map_unused_to_zeropage(&pvmw, folio, old_pte, idx))
continue;
folio_get(folio);
pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
- old_pte = ptep_get(pvmw.pte);
entry = pte_to_swp_entry(old_pte);
if (!is_migration_entry_young(entry))
_
Patches currently in -mm which might be from lance.yang(a)linux.dev are
hung_task-fix-warnings-caused-by-unaligned-lock-pointers.patch
mm-thp-fix-mte-tag-mismatch-when-replacing-zero-filled-subpages.patch
mm-rmap-fix-soft-dirty-and-uffd-wp-bit-loss-when-remapping-zero-filled-mthp-subpage-to-shared-zeropage.patch
mm-clean-up-is_guard_pte_marker.patch
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
syzbot reported multiple use-after-free bugs when accessing extent headers
in various ext4 functions. These occur because extent headers can be freed
by concurrent operations while other threads still hold pointers to them.
The issue is triggered by racing threads performing concurrent writes to
the same file. After commit 665575cff098 ("filemap: move prefaulting out
of hot write path"), the write path no longer prefaults pages in the hot
path, creating a wider race window where:
1. Thread A calls ext4_find_extent() and gets a path with extent headers
2. Thread A's write attempt fails, entering the slow path
3. During the gap, Thread B modifies the extent tree, freeing nodes
4. Thread A continues using the now-freed extent headers, causing UAF
Fix this by validating the extent header in ext4_find_extent() before
returning the path. This ensures all callers receive a valid extent path,
fixing the race at a single point rather than adding checks throughout
the codebase.
This addresses crashes in ext4_ext_insert_extent(), ext4_ext_binsearch(),
and potentially other locations that use extent paths.
Reported-by: syzbot+9db318d6167044609878(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9db318d6167044609878
Fixes: 665575cff098 ("filemap: move prefaulting out of hot write path")
Cc: stable(a)vger.kernel.org
Signed-off-by: Deepanshu Kartikey <kartikey406(a)gmail.com>
---
fs/ext4/extents.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca5499e9412b..04ceae5b0a34 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4200,6 +4200,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
unsigned int allocated_clusters = 0;
struct ext4_allocation_request ar;
ext4_lblk_t cluster_offset;
+ struct ext4_extent_header *eh;
ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len);
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
@@ -4212,7 +4213,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
depth = ext_depth(inode);
-
+ eh = path[depth].p_hdr;
+ if (!eh || le16_to_cpu(eh->eh_magic) != EXT4_EXT_MAGIC) {
+ EXT4_ERROR_INODE(inode, "invalid extent header after find_extent");
+ err = -EFSCORRUPTED;
+ goto out;
+ }
/*
* consistent leaf must not be empty;
* this situation is possible, though, _during_ tree modification;
--
2.43.0