kmsan_free_page() is called by the page allocator's free_pages_prepare()
during page freeing. It's job is to poison all the memory covered by the
page. It can be called with an order-0 page, a compound high-order page
or a non-compound high-order page. But page_size() only works for
order-0 and compound pages. For a non-compound high-order page it will
incorrectly return PAGE_SIZE.
The implication is that the tail pages of a high-order non-compound page
do not get poisoned at free, so any invalid access while they are free
could go unnoticed. It looks like the pages will be poisoned again at
allocaiton time, so that would bookend the window.
Fix this by using the order parameter to calculate the size.
Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
Hi,
I noticed this during code review, so perhaps I've just misunderstood the intent
of the code.
I don't have the means to compile and run on x86 with KMSAN enabled though, so
punting this out hoping someone might be able to validate/test. I guess there is
a small chance this could lead to KMSAN finding some new issues?
Applies against today's mm-unstable (344d3580dacd).
Thanks,
Ryan
mm/kmsan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index e7f554a31bb4..9e1c5f2b7a41 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -207,7 +207,7 @@ void kmsan_free_page(struct page *page, unsigned int order)
if (!kmsan_enabled || kmsan_in_runtime())
return;
kmsan_enter_runtime();
- kmsan_internal_poison_memory(page_address(page), page_size(page),
+ kmsan_internal_poison_memory(page_address(page), PAGE_SIZE << order,
GFP_KERNEL & ~(__GFP_RECLAIM),
KMSAN_POISON_CHECK | KMSAN_POISON_FREE);
kmsan_leave_runtime();
--
2.43.0
The patch titled
Subject: mm: kmsan: fix poisoning of high-order non-compound pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-kmsan-fix-poisoning-of-high-order-non-compound-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via various
branches at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there most days
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: mm: kmsan: fix poisoning of high-order non-compound pages
Date: Sun, 4 Jan 2026 13:43:47 +0000
kmsan_free_page() is called by the page allocator's free_pages_prepare()
during page freeing. Its job is to poison all the memory covered by the
page. It can be called with an order-0 page, a compound high-order page
or a non-compound high-order page. But page_size() only works for order-0
and compound pages. For a non-compound high-order page it will
incorrectly return PAGE_SIZE.
The implication is that the tail pages of a high-order non-compound page
do not get poisoned at free, so any invalid access while they are free
could go unnoticed. It looks like the pages will be poisoned again at
allocation time, so that would bookend the window.
Fix this by using the order parameter to calculate the size.
Link: https://lkml.kernel.org/r/20260104134348.3544298-1-ryan.roberts@arm.com
Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmsan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kmsan/shadow.c~mm-kmsan-fix-poisoning-of-high-order-non-compound-pages
+++ a/mm/kmsan/shadow.c
@@ -207,7 +207,7 @@ void kmsan_free_page(struct page *page,
if (!kmsan_enabled || kmsan_in_runtime())
return;
kmsan_enter_runtime();
- kmsan_internal_poison_memory(page_address(page), page_size(page),
+ kmsan_internal_poison_memory(page_address(page), PAGE_SIZE << order,
GFP_KERNEL & ~(__GFP_RECLAIM),
KMSAN_POISON_CHECK | KMSAN_POISON_FREE);
kmsan_leave_runtime();
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-kmsan-fix-poisoning-of-high-order-non-compound-pages.patch
From: Shigeru Yoshida <syoshida(a)redhat.com>
commit fc1092f51567277509563800a3c56732070b6aa4 upstream.
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb()
tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
while __ip_make_skb() is running, the function will access icmphdr in the
skb even if it is not included. This causes the issue reported by KMSAN.
Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
on the socket.
Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
are union in struct flowi4 and are implicitly initialized by
flowi4_init_output(), but we should not rely on specific union layout.
Initialize these explicitly in raw_sendmsg().
[1]
BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
__ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
ip_finish_skb include/net/ip.h:243 [inline]
ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Uninit was created at:
slab_post_alloc_hook mm/slub.c:3804 [inline]
slab_alloc_node mm/slub.c:3845 [inline]
kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
__alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
alloc_skb include/linux/skbuff.h:1318 [inline]
__ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
Reported-by: syzkaller <syzkaller(a)googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida(a)redhat.com>
Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Signed-off-by: Shubham Kulkarni <skulkarni(a)mvista.com>
---
Referred stable v6.1.y version of the patch to generate this one
[ v6.1 link: https://github.com/gregkh/linux/commit/55bf541e018b76b3750cb6c6ea18c46e1ac5… ]
---
net/ipv4/ip_output.c | 3 ++-
net/ipv4/raw.c | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1e430e135aa6..70c316c0537f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1572,7 +1572,8 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
* so icmphdr does not in skb linear region and can not get icmp_type
* by icmp_hdr(skb)->type.
*/
- if (sk->sk_type == SOCK_RAW && !inet_sk(sk)->hdrincl)
+ if (sk->sk_type == SOCK_RAW &&
+ !(fl4->flowi4_flags & FLOWI_FLAG_KNOWN_NH))
icmp_type = fl4->fl4_icmp_type;
else
icmp_type = icmp_hdr(skb)->type;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 650da4d8f7ad..5a8d5ebe6590 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -634,6 +634,9 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
(hdrincl ? FLOWI_FLAG_KNOWN_NH : 0),
daddr, saddr, 0, 0, sk->sk_uid);
+ fl4.fl4_icmp_type = 0;
+ fl4.fl4_icmp_code = 0;
+
if (!hdrincl) {
rfv.msg = msg;
rfv.hlen = 0;
--
2.25.1
From: Shigeru Yoshida <syoshida(a)redhat.com>
commit 4e13d3a9c25b7080f8a619f961e943fe08c2672c upstream.
As it was done in commit fc1092f51567 ("ipv4: Fix uninit-value access in
__ip_make_skb()") for IPv4, check FLOWI_FLAG_KNOWN_NH on fl6->flowi6_flags
instead of testing HDRINCL on the socket to avoid a race condition which
causes uninit-value access.
Fixes: ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()")
Signed-off-by: Shigeru Yoshida <syoshida(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Shubham Kulkarni <skulkarni(a)mvista.com>
---
Referred stable v6.1.y version of the patch to generate this one
[ v6.1 link: https://github.com/gregkh/linux/commit/a05c1ede50e9656f0752e523c7b54f3a3489… ]
---
net/ipv6/ip6_output.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 426330b8dfa4..99ee18b3a953 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1917,7 +1917,8 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
u8 icmp6_type;
- if (sk->sk_socket->type == SOCK_RAW && !inet_sk(sk)->hdrincl)
+ if (sk->sk_socket->type == SOCK_RAW &&
+ !(fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH))
icmp6_type = fl6->fl6_icmp_type;
else
icmp6_type = icmp6_hdr(skb)->icmp6_type;
--
2.25.1
Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
the driver") speeds up the loading of large numbers of device drivers by
submitting asynchronous probe workers to an unbounded workqueue and binding
each worker to the CPU near the device’s NUMA node. These workers are not
scheduled on isolated CPUs because their cpumask is restricted to
housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
However, when PCI devices reside on the same NUMA node, all their
drivers’ probe workers are bound to the same CPU within that node, yet
the probes still run in parallel because pci_call_probe() invokes
work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
system_percpu_wq to bind the probe thread to the first CPU in the
device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
1. The function __driver_attach() submits an asynchronous worker with
callback __driver_attach_async_helper().
__driver_attach()
async_schedule_dev(__driver_attach_async_helper, dev)
async_schedule_node(func, dev, dev_to_node(dev))
async_schedule_node_domain(func, data, node, &async_dfl_domain)
__async_schedule_node_domain(func, data, node, domain, entry)
queue_work_node(node, async_wq, &entry->work)
2. The asynchronous probe worker ultimately calls work_on_cpu() in
pci_call_probe(), binding the worker to the same CPU within the
device’s NUMA node.
__driver_attach_async_helper()
driver_probe_device(drv, dev)
__driver_probe_device(drv, dev)
really_probe(dev, drv)
call_driver_probe(dev, drv)
dev->bus->probe(dev)
pci_device_probe(dev)
__pci_device_probe(drv, pci_dev)
pci_call_probe(drv, pci_dev, id)
cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
error = work_on_cpu(cpu, local_pci_probe, &ddi)
schedule_work_on(cpu, &wfc.work);
queue_work_on(cpu, system_percpu_wq, work)
To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
already running inside an unbounded asynchronous worker. Because a driver
can be probed asynchronously either by probe_type or by the kernel command
line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
executing within an unbounded workqueue worker and should skip the extra
work_on_cpu() call.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor shows a 35 % probe-time improvement with the patch:
Before (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
After (spread across CPUs 1, 2, 5):
nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
The improvement grows with more PCI devices because fewer probes contend
for the same CPU.
Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam(a)bytedance.com>
---
drivers/pci/pci-driver.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d596258..4bc47a84d330 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
/*
* Prevent nesting work_on_cpu() for the case where a Virtual Function
* device is probed from work_on_cpu() of the Physical device.
+ * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
+ * probe worker when the driver allows asynchronous probing.
*/
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
- pci_physfn_is_probed(dev)) {
+ pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
cpu = nr_cpu_ids;
} else {
cpumask_var_t wq_domain_mask;
--
2.20.1
crypto_alloc_acomp_node() may return ERR_PTR(), but the fail path checks
only for NULL and can pass an error pointer to crypto_free_acomp().
Use IS_ERR_OR_NULL() to only free valid acomp instances.
Fixes: 779b9955f643 ("mm: zswap: move allocations during CPU init outside the lock")
Cc: stable(a)vger.kernel.org
Signed-off-by: Pavel Butsykin <pbutsykin(a)cloudlinux.com>
---
mm/zswap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958..ac9b7a60736b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -787,7 +787,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
return 0;
fail:
- if (acomp)
+ if (!IS_ERR_OR_NULL(acomp))
crypto_free_acomp(acomp);
kfree(buffer);
return ret;
--
2.52.0
On Fri, 2 Jan 2026, at 3:52 PM, Icenowy Zheng wrote:
> Currently the LS7A GMAC device tree node lacks a proper phy-handle
> property pointing to the PHY node.
>
> In addition, the phy-mode property specifies "rgmii" without any
> internal delay information, which means the board trace needs to add 2ns
> delay to the RGMII data lines; but that isn't known to happen on any
> Loongson board. The ACPI-based initialization codepath, which is used on
> LoongArch-based 3A5000 + 7A1000 hardwares, specifies "rgmii-id" phy
> mode, which should be the one we are using.
>
> Add the lacking phy-handle property and set proper phy-mode.
>
> Tested on a LS3A4000_7A1000_NUC_BOARD_V2.1 board with YT8521S PHY.
>
> Signed-off-by: Icenowy Zheng <zhengxingda(a)iscas.ac.cn>
Good catch! This with fine with realtek phy chips but YT8521S seems
to be picky.
Reviewed-by: Jiaxun Yang <jiaxun.yang(a)flygoat.com>
Also maybe:
Cc: stable(a)vger.kernel.org
Given those boards rely on built-in DT.
Thanks
> ---
> arch/mips/boot/dts/loongson/ls7a-pch.dtsi | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/mips/boot/dts/loongson/ls7a-pch.dtsi
> b/arch/mips/boot/dts/loongson/ls7a-pch.dtsi
> index ee71045883e7e..6dee85909f5a6 100644
> --- a/arch/mips/boot/dts/loongson/ls7a-pch.dtsi
> +++ b/arch/mips/boot/dts/loongson/ls7a-pch.dtsi
> @@ -199,7 +199,8 @@ gmac@3,0 {
> <13 IRQ_TYPE_LEVEL_HIGH>;
> interrupt-names = "macirq", "eth_lpi";
> interrupt-parent = <&pic>;
> - phy-mode = "rgmii";
> + phy-mode = "rgmii-id";
> + phy-handle = <&phy0>;
> mdio {
> #address-cells = <1>;
> #size-cells = <0>;
> @@ -222,7 +223,8 @@ gmac@3,1 {
> <15 IRQ_TYPE_LEVEL_HIGH>;
> interrupt-names = "macirq", "eth_lpi";
> interrupt-parent = <&pic>;
> - phy-mode = "rgmii";
> + phy-mode = "rgmii-id";
> + phy-handle = <&phy1>;
> mdio {
> #address-cells = <1>;
> #size-cells = <0>;
> --
> 2.52.0
--
- Jiaxun