The patch titled
Subject: mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
Date: Thu, 24 Oct 2024 17:12:29 +0200
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address hint
and of at least PMD_SIZE will be aligned to PMD so that it can benefit
from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable [2]
[3] and early testing suggests this patch fixes the regression there as
well.
To fix the regression but still try to benefit from THP-friendly anonymous
mapping alignment, add a condition that the size of the mapping must be a
multiple of PMD size instead of at least PMD size. In case of many
odd-sized mapping like the cactusBSSN creates, those will stop being
aligned and with gaps between, and instead naturally merge again.
Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Reported-by: Michael Matz <matz(a)suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel(a)krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias(a)bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.i… [3]
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Petr Tesarik <ptesarik(a)suse.com>
Cc: Thorsten Leemhuis <regressions(a)leemhuis.info>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/mmap.c~mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes
+++ a/mm/mmap.c
@@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, u
if (get_area) {
addr = get_area(file, addr, len, pgoff, flags);
- } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+ && IS_ALIGNED(len, PMD_SIZE)) {
/* Ensures that larger anonymous mappings are THP aligned. */
addr = thp_get_unmapped_area_vmflags(file, addr, len,
pgoff, flags, vm_flags);
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes.patch
The patch titled
Subject: mm: shrinker: avoid memleak in alloc_shrinker_info
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-shrinker-avoid-memleak-in-alloc_shrinker_info.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Chen Ridong <chenridong(a)huawei.com>
Subject: mm: shrinker: avoid memleak in alloc_shrinker_info
Date: Fri, 25 Oct 2024 06:09:42 +0000
A memleak was found as below:
unreferenced object 0xffff8881010d2a80 (size 32):
comm "mkdir", pid 1559, jiffies 4294932666
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 @...............
backtrace (crc 2e7ef6fa):
[<ffffffff81372754>] __kmalloc_node_noprof+0x394/0x470
[<ffffffff813024ab>] alloc_shrinker_info+0x7b/0x1a0
[<ffffffff813b526a>] mem_cgroup_css_online+0x11a/0x3b0
[<ffffffff81198dd9>] online_css+0x29/0xa0
[<ffffffff811a243d>] cgroup_apply_control_enable+0x20d/0x360
[<ffffffff811a5728>] cgroup_mkdir+0x168/0x5f0
[<ffffffff8148543e>] kernfs_iop_mkdir+0x5e/0x90
[<ffffffff813dbb24>] vfs_mkdir+0x144/0x220
[<ffffffff813e1c97>] do_mkdirat+0x87/0x130
[<ffffffff813e1de9>] __x64_sys_mkdir+0x49/0x70
[<ffffffff81f8c928>] do_syscall_64+0x68/0x140
[<ffffffff8200012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
alloc_shrinker_info(), when shrinker_unit_alloc() returns an errer, the
info won't be freed. Just fix it.
Link: https://lkml.kernel.org/r/20241025060942.1049263-1-chenridong@huaweicloud.c…
Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}")
Signed-off-by: Chen Ridong <chenridong(a)huawei.com>
Acked-by: Qi Zheng <zhengqi.arch(a)bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Wang Weiyang <wangweiyang2(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shrinker.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- a/mm/shrinker.c~mm-shrinker-avoid-memleak-in-alloc_shrinker_info
+++ a/mm/shrinker.c
@@ -76,19 +76,21 @@ void free_shrinker_info(struct mem_cgrou
int alloc_shrinker_info(struct mem_cgroup *memcg)
{
- struct shrinker_info *info;
int nid, ret = 0;
int array_size = 0;
mutex_lock(&shrinker_mutex);
array_size = shrinker_unit_size(shrinker_nr_max);
for_each_node(nid) {
- info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid);
+ struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
+ GFP_KERNEL, nid);
if (!info)
goto err;
info->map_nr_max = shrinker_nr_max;
- if (shrinker_unit_alloc(info, NULL, nid))
+ if (shrinker_unit_alloc(info, NULL, nid)) {
+ kvfree(info);
goto err;
+ }
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
mutex_unlock(&shrinker_mutex);
_
Patches currently in -mm which might be from chenridong(a)huawei.com are
mm-shrinker-avoid-memleak-in-alloc_shrinker_info.patch
We notice some platforms set "snps,dis_u3_susphy_quirk" and
"snps,dis_u2_susphy_quirk" when they should not need to. Just make sure that
the GUSB3PIPECTL.SUSPENDENABLE and GUSB2PHYCFG.SUSPHY are clear during
initialization. The host initialization involved xhci. So the dwc3 needs to
implement the xhci_plat_priv->plat_start() for xhci to re-enable the suspend
bits.
Since there's a prerequisite patch to drivers/usb/host/xhci-plat.h that's not a
fix patch, this series should go on Greg's usb-testing branch instead of
usb-linus.
Thinh Nguyen (2):
usb: xhci-plat: Don't include xhci.h
usb: dwc3: core: Prevent phy suspend during init
drivers/usb/dwc3/core.c | 90 +++++++++++++++---------------------
drivers/usb/dwc3/core.h | 1 +
drivers/usb/dwc3/gadget.c | 2 +
drivers/usb/dwc3/host.c | 27 +++++++++++
drivers/usb/host/xhci-plat.h | 4 +-
5 files changed, 71 insertions(+), 53 deletions(-)
base-commit: 3d122e6d27e417a9fa91181922743df26b2cd679
--
2.28.0
The patch titled
Subject: vmscan,migrate: fix double-decrement on node stats when demoting pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Gregory Price <gourry(a)gourry.net>
Subject: vmscan,migrate: fix double-decrement on node stats when demoting pages
Date: Fri, 25 Oct 2024 10:17:24 -0400
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. Successful demotions will cause node
vmstat numbers to double-decrement, leading to an imbalanced page count.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The double-decrement occurs in the migrate_pages path:
caller to shrink_folio_list decrements the count
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- second decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is
the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
Fixes: 26aa2d199d6f2 ("mm/migrate: demote pages during reclaim")
Signed-off-by: Gregory Price <gourry(a)gourry.net>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/migrate.c~vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages
+++ a/mm/migrate.c
@@ -1178,7 +1178,7 @@ static void migrate_folio_done(struct fo
* not accounted to NR_ISOLATED_*. They can be recognized
* as __folio_test_movable
*/
- if (likely(!__folio_test_movable(src)))
+ if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION)
mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
folio_is_file_lru(src), -folio_nr_pages(src));
_
Patches currently in -mm which might be from gourry(a)gourry.net are
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
Returning an abort to the guest for an unsupported MMIO access is a
documented feature of the KVM UAPI. Nevertheless, it's clear that this
plumbing has seen limited testing, since userspace can trivially cause a
WARN in the MMIO return:
WARNING: CPU: 0 PID: 30558 at arch/arm64/include/asm/kvm_emulate.h:536 kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
Call trace:
kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
kvm_arch_vcpu_ioctl_run+0x98/0x15b4 arch/arm64/kvm/arm.c:1133
kvm_vcpu_ioctl+0x75c/0xa78 virt/kvm/kvm_main.c:4487
__do_sys_ioctl fs/ioctl.c:51 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x1e0/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x38/0x68 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The splat is complaining that KVM is advancing PC while an exception is
pending, i.e. that KVM is retiring the MMIO instruction despite a
pending synchronous external abort. Womp womp.
Fix the glaring UAPI bug by skipping over all the MMIO emulation in
case there is a pending synchronous exception. Note that while userspace
is capable of pending an asynchronous exception (SError, IRQ, or FIQ),
it is still safe to retire the MMIO instruction in this case as (1) they
are by definition asynchronous, and (2) KVM relies on hardware support
for pending/delivering these exceptions instead of the software state
machine for advancing PC.
Cc: stable(a)vger.kernel.org
Fixes: da345174ceca ("KVM: arm/arm64: Allow user injection of external data aborts")
Reported-by: Alexander Potapenko <glider(a)google.com>
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
arch/arm64/kvm/mmio.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index cd6b7b83e2c3..ab365e839874 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -72,6 +72,31 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len)
return data;
}
+static bool kvm_pending_sync_exception(struct kvm_vcpu *vcpu)
+{
+ if (!vcpu_get_flag(vcpu, PENDING_EXCEPTION))
+ return false;
+
+ if (vcpu_el1_is_32bit(vcpu)) {
+ switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) {
+ case unpack_vcpu_flag(EXCEPT_AA32_UND):
+ case unpack_vcpu_flag(EXCEPT_AA32_IABT):
+ case unpack_vcpu_flag(EXCEPT_AA32_DABT):
+ return true;
+ default:
+ return false;
+ }
+ } else {
+ switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) {
+ case unpack_vcpu_flag(EXCEPT_AA64_EL1_SYNC):
+ case unpack_vcpu_flag(EXCEPT_AA64_EL2_SYNC):
+ return true;
+ default:
+ return false;
+ }
+ }
+}
+
/**
* kvm_handle_mmio_return -- Handle MMIO loads after user space emulation
* or in-kernel IO emulation
@@ -84,8 +109,11 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
unsigned int len;
int mask;
- /* Detect an already handled MMIO return */
- if (unlikely(!vcpu->mmio_needed))
+ /*
+ * Detect if the MMIO return was already handled or if userspace aborted
+ * the MMIO access.
+ */
+ if (unlikely(!vcpu->mmio_needed || kvm_pending_sync_exception(vcpu)))
return 1;
vcpu->mmio_needed = 0;
--
2.47.0.163.g1226f6d8fa-goog
The patch titled
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Shawn Wang <shawnwang(a)linux.alibaba.com>
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
Date: Fri, 25 Oct 2024 10:22:08 +0800
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work() for
numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.c…
Fixes: 214dbc428137 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang(a)linux.alibaba.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Cc: Valentin Schneider <vschneid(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: <stable(a)vger.kernel.org> [6.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c~sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work
+++ a/kernel/sched/fair.c
@@ -3369,7 +3369,7 @@ retry_pids:
vma = vma_next(&vmi);
}
- do {
+ for (; vma; vma = vma_next(&vmi)) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
@@ -3491,7 +3491,7 @@ retry_pids:
*/
if (vma_pids_forced)
break;
- } for_each_vma(vmi, vma);
+ }
/*
* If no VMAs are remaining and VMAs were skipped due to the PID
_
Patches currently in -mm which might be from shawnwang(a)linux.alibaba.com are
sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work.patch
Commit 723e8462a4fe ("pinctrl: qcom: spmi-gpio: Fix the GPIO strength
mapping") fixed a long-standing issue in the Qualcomm SPMI PMIC gpio
driver which had the 'low' and 'high' drive strength settings switched
but failed to update the debugfs interface which still gets this wrong.
Fix the debugfs code so that the exported values match the hardware
settings.
Note that this probably means that most devicetrees that try to describe
the firmware settings got this wrong if the settings were derived from
debugfs. Before the above mentioned commit the settings would have
actually matched the firmware settings even if they were described
incorrectly, but now they are inverted.
Fixes: 723e8462a4fe ("pinctrl: qcom: spmi-gpio: Fix the GPIO strength mapping")
Fixes: eadff3024472 ("pinctrl: Qualcomm SPMI PMIC GPIO pin controller driver")
Cc: Anjelique Melendez <quic_amelende(a)quicinc.com>
Cc: stable(a)vger.kernel.org # 3.19
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/pinctrl/qcom/pinctrl-spmi-gpio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
index 3d03293f6320..3a12304e2b7d 100644
--- a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
+++ b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
@@ -667,7 +667,7 @@ static void pmic_gpio_config_dbg_show(struct pinctrl_dev *pctldev,
"push-pull", "open-drain", "open-source"
};
static const char *const strengths[] = {
- "no", "high", "medium", "low"
+ "no", "low", "medium", "high"
};
pad = pctldev->desc->pins[pin].drv_data;
--
2.45.2
The quilt patch titled
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
has been removed from the -mm tree. Its filename was
mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
Date: Wed, 23 Oct 2024 10:50:37 -0700
In the case of memoryless node, when a process prefers a node with no
memory(e.g., because it is running on a CPU local to that node), the
kernel treats a nearby node with memory as the preferred node. As a
result, such allocations do not increment the numa_foreign counter on the
memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
stats for the nearest node.
This patch corrects this issue by:
1. Checking if the zone or preferred zone is CPU-less before updating
the NUMA stats.
2. Ensuring NUMA_HIT is only updated if the zone is not CPU-less.
3. Ensuring NUMA_FOREIGN is only updated if the preferred zone is not
CPU-less.
Example Before and After Patch:
- Before Patch:
node0 node1 node2
numa_hit 86333181 114338269 5108
numa_miss 5199455 0 56844591
numa_foreign 32281033 29763013 0
interleave_hit 91 91 0
local_node 86326417 114288458 0
other_node 5206219 49768 56849702
- After Patch:
node0 node1 node2
numa_hit 2523058 9225528 0
numa_miss 150213 10226 21495942
numa_foreign 17144215 4501270 0
interleave_hit 91 94 0
local_node 2493918 9208226 0
other_node 179351 27528 21495942
Similarly, in the context of cpuless nodes, this patch ensures that NUMA
statistics are accurately updated by adding checks to prevent the
miscounting of memory allocations when the involved nodes have no CPUs.
This ensures more precise tracking of memory access patterns across all
nodes, regardless of whether they have CPUs or not, improving the overall
reliability of NUMA stat. The reason is that page allocation from
dev_dax, cpuset, memcg .. comes with preferred allocating zone in cpuless
node and it's hard to track the zone info for miss information.
Link: https://lkml.kernel.org/r/20241023175037.9125-1-dongjoo.linux.dev@gmail.com
Signed-off-by: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Fan Ni <nifan(a)outlook.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Adam Manzanares <a.manzanares(a)samsung.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes
+++ a/mm/page_alloc.c
@@ -2858,19 +2858,21 @@ static inline void zone_statistics(struc
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
+ bool z_is_cpuless = !node_state(zone_to_nid(z), N_CPU);
+ bool pref_is_cpuless = !node_state(zone_to_nid(preferred_zone), N_CPU);
- /* skip numa counters update if numa stats is disabled */
if (!static_branch_likely(&vm_numa_stat_key))
return;
- if (zone_to_nid(z) != numa_node_id())
+ if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
local_stat = NUMA_OTHER;
- if (zone_to_nid(z) == zone_to_nid(preferred_zone))
+ if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
__count_numa_events(z, NUMA_HIT, nr_account);
else {
__count_numa_events(z, NUMA_MISS, nr_account);
- __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
+ if (!pref_is_cpuless)
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
__count_numa_events(z, local_stat, nr_account);
#endif
_
Patches currently in -mm which might be from dongjoo.linux.dev(a)gmail.com are
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile
(Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
[1]: http://lore.kernel.org/172862483180.2150669.5564474284074502692.stgit@dwill…
---
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
delayed arrival of the CXL "root" infrastructure [1] prompted questions
of how the existing mechanism for retrying cxl_mem_probe() could be
failing.
The critical missing piece in the debug was that Gregory's setup had
almost all CXL modules built-in to the kernel.
On the way to that discovery several other bugs and init-order corner
cases were discovered.
The main fix is to make sure the drivers/cxl/Makefile object order
supports root CXL ports being fully initialized upon cxl_acpi_probe()
exit. The modular case has some similar potential holes that are fixed
with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
cxl_test to reproduce the original report resulted in the discovery of a
separate long standing use after free bug in cxl_region_detach().
[2]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
---
Dan Williams (6):
cxl/port: Fix CXL port initialization order when the subsystem is built-in
cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
cxl/port: Prevent out-of-order decoder allocation
cxl/test: Improve init-order fidelity relative to real-world systems
drivers/base/core.c | 35 +++++++
drivers/cxl/Kconfig | 1
drivers/cxl/Makefile | 20 +++-
drivers/cxl/acpi.c | 7 +
drivers/cxl/core/hdm.c | 50 +++++++++--
drivers/cxl/core/port.c | 13 ++-
drivers/cxl/core/region.c | 91 ++++++++++---------
drivers/cxl/cxl.h | 3 -
include/linux/device.h | 3 +
tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++-------------------
tools/testing/cxl/test/mem.c | 1
11 files changed, 269 insertions(+), 155 deletions(-)
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
KASAN reports that the GPU metrics table allocated in
vangogh_tables_init() is not large enough for the memset done in
smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
[ 33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
[ 33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067
...
[ 33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G W 6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
[ 33.861816] Tainted: [W]=WARN
[ 33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023
[ 33.861822] Call Trace:
[ 33.861826] <TASK>
[ 33.861829] dump_stack_lvl+0x66/0x90
[ 33.861838] print_report+0xce/0x620
[ 33.861853] kasan_report+0xda/0x110
[ 33.862794] kasan_check_range+0xfd/0x1a0
[ 33.862799] __asan_memset+0x23/0x40
[ 33.862803] smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.863306] vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.864257] vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.865682] amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.866160] amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.867135] dev_attr_show+0x43/0xc0
[ 33.867147] sysfs_kf_seq_show+0x1f1/0x3b0
[ 33.867155] seq_read_iter+0x3f8/0x1140
[ 33.867173] vfs_read+0x76c/0xc50
[ 33.867198] ksys_read+0xfb/0x1d0
[ 33.867214] do_syscall_64+0x90/0x160
...
[ 33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
[ 33.867358] kasan_save_stack+0x33/0x50
[ 33.867364] kasan_save_track+0x17/0x60
[ 33.867367] __kasan_kmalloc+0x87/0x90
[ 33.867371] vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
[ 33.867835] smu_sw_init+0xa32/0x1850 [amdgpu]
[ 33.868299] amdgpu_device_init+0x467b/0x8d90 [amdgpu]
[ 33.868733] amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
[ 33.869167] amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
[ 33.869608] local_pci_probe+0xda/0x180
[ 33.869614] pci_device_probe+0x43f/0x6b0
Empirically we can confirm that the former allocates 152 bytes for the
table, while the latter memsets the 168 large block.
This is somewhat alleviated by the fact that allocation goes into a 192
SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes which
would definitely be a problem.
Root cause appears that when GPU metrics tables for v2_4 parts were added
it was not considered to enlarge the table to fit.
The fix in this patch is rather "brute force" and perhaps later should be
done in a smarter way, by extracting and consolidating the part version to
size logic to a common helper, instead of brute forcing the largest
possible allocation. Nevertheless, for now this works and fixes the out of
bounds write.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Evan Quan <evan.quan(a)amd.com>
Cc: Wenyou Yang <WenYou.Yang(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: <stable(a)vger.kernel.org> # v6.6+
---
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 22737b11b1bf..36f4a4651918 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -242,7 +242,10 @@ static int vangogh_tables_init(struct smu_context *smu)
goto err0_out;
smu_table->metrics_time = 0;
- smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
+ smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));
smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
if (!smu_table->gpu_metrics_table)
goto err1_out;
--
2.46.0