Linux-stable-mirror August 2025

linux-stable-mirror@lists.linaro.org

518 participants
1331 discussions

[merged mm-hotfixes-stable] kho-warn-if-kho-is-disabled-due-to-an-error.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: kho: warn if KHO is disabled due to an error has been removed from the -mm tree. Its filename was kho-warn-if-kho-is-disabled-due-to-an-error.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Pasha Tatashin <pasha.tatashin(a)soleen.com> Subject: kho: warn if KHO is disabled due to an error Date: Fri, 8 Aug 2025 20:18:04 +0000 During boot scratch area is allocated based on command line parameters or auto calculated. However, scratch area may fail to allocate, and in that case KHO is disabled. Currently, no warning is printed that KHO is disabled, which makes it confusing for the end user to figure out why KHO is not available. Add the missing warning message. Link: https://lkml.kernel.org/r/20250808201804.772010-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin(a)soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Acked-by: Pratyush Yadav <pratyush(a)kernel.org> Cc: Alexander Graf <graf(a)amazon.com> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: Baoquan He <bhe(a)redhat.com> Cc: Changyuan Lyu <changyuanl(a)google.com> Cc: Coiby Xu <coxu(a)redhat.com> Cc: Dave Vasilevsky <dave(a)vasilevsky.ca> Cc: Eric Biggers <ebiggers(a)google.com> Cc: Kees Cook <kees(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- kernel/kexec_handover.c | 1 + 1 file changed, 1 insertion(+) --- a/kernel/kexec_handover.c~kho-warn-if-kho-is-disabled-due-to-an-error +++ a/kernel/kexec_handover.c @@ -564,6 +564,7 @@ err_free_scratch_areas: err_free_scratch_desc: memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch)); err_disable_kho: + pr_warn("Failed to reserve scratch area, disabling kexec handover\n"); kho_enable = false; } _ Patches currently in -mm which might be from pasha.tatashin(a)soleen.com are

4 months, 2 weeks

[merged mm-hotfixes-stable] kho-mm-dont-allow-deferred-struct-page-with-kho.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: kho: mm: don't allow deferred struct page with KHO has been removed from the -mm tree. Its filename was kho-mm-dont-allow-deferred-struct-page-with-kho.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Pasha Tatashin <pasha.tatashin(a)soleen.com> Subject: kho: mm: don't allow deferred struct page with KHO Date: Fri, 8 Aug 2025 20:18:03 +0000 KHO uses struct pages for the preserved memory early in boot, however, with deferred struct page initialization, only a small portion of memory has properly initialized struct pages. This problem was detected where vmemmap is poisoned, and illegal flag combinations are detected. Don't allow them to be enabled together, and later we will have to teach KHO to work properly with deferred struct page init kernel feature. Link: https://lkml.kernel.org/r/20250808201804.772010-3-pasha.tatashin@soleen.com Fixes: 4e1d010e3bda ("kexec: add config option for KHO") Signed-off-by: Pasha Tatashin <pasha.tatashin(a)soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Acked-by: Pratyush Yadav <pratyush(a)kernel.org> Cc: Alexander Graf <graf(a)amazon.com> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: Baoquan He <bhe(a)redhat.com> Cc: Changyuan Lyu <changyuanl(a)google.com> Cc: Coiby Xu <coxu(a)redhat.com> Cc: Dave Vasilevsky <dave(a)vasilevsky.ca> Cc: Eric Biggers <ebiggers(a)google.com> Cc: Kees Cook <kees(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- kernel/Kconfig.kexec | 1 + 1 file changed, 1 insertion(+) --- a/kernel/Kconfig.kexec~kho-mm-dont-allow-deferred-struct-page-with-kho +++ a/kernel/Kconfig.kexec @@ -97,6 +97,7 @@ config KEXEC_JUMP config KEXEC_HANDOVER bool "kexec handover" depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE + depends on !DEFERRED_STRUCT_PAGE_INIT select MEMBLOCK_KHO_SCRATCH select KEXEC_FILE select DEBUG_FS _ Patches currently in -mm which might be from pasha.tatashin(a)soleen.com are

4 months, 2 weeks

[merged mm-hotfixes-stable] kho-init-new_physxa-phys_bits-to-fix-lockdep.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: kho: init new_physxa->phys_bits to fix lockdep has been removed from the -mm tree. Its filename was kho-init-new_physxa-phys_bits-to-fix-lockdep.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Pasha Tatashin <pasha.tatashin(a)soleen.com> Subject: kho: init new_physxa->phys_bits to fix lockdep Date: Fri, 8 Aug 2025 20:18:02 +0000 Patch series "Several KHO Hotfixes". Three unrelated fixes for Kexec Handover. This patch (of 3): Lockdep shows the following warning: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0 [<ffffffff8136012c>] assign_lock_key+0x10c/0x120 [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0 [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40 [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10 [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0 [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0 [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff81359556>] lock_acquire+0xe6/0x280 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0 [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100 [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400 [<ffffffff813ebef4>] kho_finalize+0x34/0x70 This is becase xa has its own lock, that is not initialized in xa_load_or_alloc. Modifiy __kho_preserve_order(), to properly call xa_init(&new_physxa->phys_bits); Link: https://lkml.kernel.org/r/20250808201804.772010-2-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin(a)soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Cc: Alexander Graf <graf(a)amazon.com> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: Baoquan He <bhe(a)redhat.com> Cc: Changyuan Lyu <changyuanl(a)google.com> Cc: Coiby Xu <coxu(a)redhat.com> Cc: Dave Vasilevsky <dave(a)vasilevsky.ca> Cc: Eric Biggers <ebiggers(a)google.com> Cc: Kees Cook <kees(a)kernel.org> Cc: Pratyush Yadav <pratyush(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- kernel/kexec_handover.c | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) --- a/kernel/kexec_handover.c~kho-init-new_physxa-phys_bits-to-fix-lockdep +++ a/kernel/kexec_handover.c @@ -144,14 +144,34 @@ static int __kho_preserve_order(struct k unsigned int order) { struct kho_mem_phys_bits *bits; - struct kho_mem_phys *physxa; + struct kho_mem_phys *physxa, *new_physxa; const unsigned long pfn_high = pfn >> order; might_sleep(); - physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa)); - if (IS_ERR(physxa)) - return PTR_ERR(physxa); + physxa = xa_load(&track->orders, order); + if (!physxa) { + int err; + + new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL); + if (!new_physxa) + return -ENOMEM; + + xa_init(&new_physxa->phys_bits); + physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa, + GFP_KERNEL); + + err = xa_err(physxa); + if (err || physxa) { + xa_destroy(&new_physxa->phys_bits); + kfree(new_physxa); + + if (err) + return err; + } else { + physxa = new_physxa; + } + } bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS, sizeof(*bits)); _ Patches currently in -mm which might be from pasha.tatashin(a)soleen.com are

4 months, 2 weeks

+ x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() has been added to the -mm mm-hotfixes-unstable branch. Its filename is x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Harry Yoo <harry.yoo(a)oracle.com> Subject: x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() Date: Mon, 18 Aug 2025 11:02:06 +0900 Define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to ensure page tables are properly synchronized when calling p*d_populate_kernel(). For 5-level paging, synchronization is performed via pgd_populate_kernel(). In 4-level paging, pgd_populate() is a no-op, so synchronization is instead performed at the P4D level via p4d_populate_kernel(). This fixes intermittent boot failures on systems using 4-level paging and a large amount of persistent memory: BUG: unable to handle page fault for address: ffffe70000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI RIP: 0010:__init_single_page+0x9/0x6d Call Trace: <TASK> __init_zone_device_page+0x17/0x5d memmap_init_zone_device+0x154/0x1bb pagemap_range+0x2e0/0x40f memremap_pages+0x10b/0x2f0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0xce/0x2ec [device_dax] dax_bus_probe+0x6d/0xc9 [... snip ...] </TASK> It also fixes a crash in vmemmap_set_pmd() caused by accessing vmemmap before sync_global_pgds() [1]: BUG: unable to handle page fault for address: ffffeb3ff1200000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI Tainted: [W]=WARN RIP: 0010:vmemmap_set_pmd+0xff/0x230 <TASK> vmemmap_populate_hugepages+0x176/0x180 vmemmap_populate+0x34/0x80 __populate_section_memmap+0x41/0x90 sparse_add_section+0x121/0x3e0 __add_pages+0xba/0x150 add_pages+0x1d/0x70 memremap_pages+0x3dc/0x810 devm_memremap_pages+0x1c/0x60 xe_devm_add+0x8b/0x100 [xe] xe_tile_init_noalloc+0x6a/0x70 [xe] xe_device_probe+0x48c/0x740 [xe] [... snip ...] Link: https://lkml.kernel.org/r/20250818020206.4517-4-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com> Closes: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@in… [1] Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com> Acked-by: Kiryl Shutsemau <kas(a)kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Alexander Potapenko <glider(a)google.com> Cc: Alistair Popple <apopple(a)nvidia.com> Cc: Andrey Konovalov <andreyknvl(a)gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Ard Biesheuvel <ardb(a)kernel.org> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: bibo mao <maobibo(a)loongson.cn> Cc: Borislav Betkov <bp(a)alien8.de> Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org> Cc: Dennis Zhou <dennis(a)kernel.org> Cc: Dev Jain <dev.jain(a)arm.com> Cc: Dmitriy Vyukov <dvyukov(a)google.com> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Jane Chu <jane.chu(a)oracle.com> Cc: Joao Martins <joao.m.martins(a)oracle.com> Cc: Joerg Roedel <joro(a)8bytes.org> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Kevin Brodsky <kevin.brodsky(a)arm.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: Peter Xu <peterx(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: Tejun Heo <tj(a)kernel.org> Cc: Thomas Gleinxer <tglx(a)linutronix.de> Cc: Thomas Huth <thuth(a)redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/x86/include/asm/pgtable_64_types.h | 3 +++ arch/x86/mm/init_64.c | 18 ++++++++++++++++++ 2 files changed, 21 insertions(+) --- a/arch/x86/include/asm/pgtable_64_types.h~x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings +++ a/arch/x86/include/asm/pgtable_64_types.h @@ -36,6 +36,9 @@ static inline bool pgtable_l5_enabled(vo #define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57) #endif /* USE_EARLY_PGTABLE_L5 */ +#define ARCH_PAGE_TABLE_SYNC_MASK \ + (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED) + extern unsigned int pgdir_shift; extern unsigned int ptrs_per_p4d; --- a/arch/x86/mm/init_64.c~x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings +++ a/arch/x86/mm/init_64.c @@ -224,6 +224,24 @@ static void sync_global_pgds(unsigned lo } /* + * Make kernel mappings visible in all page tables in the system. + * This is necessary except when the init task populates kernel mappings + * during the boot process. In that case, all processes originating from + * the init task copies the kernel mappings, so there is no issue. + * Otherwise, missing synchronization could lead to kernel crashes due + * to missing page table entries for certain kernel mappings. + * + * Synchronization is performed at the top level, which is the PGD in + * 5-level paging systems. But in 4-level paging systems, however, + * pgd_populate() is a no-op, so synchronization is done at the P4D level. + * sync_global_pgds() handles this difference between paging levels. + */ +void arch_sync_kernel_mappings(unsigned long start, unsigned long end) +{ + sync_global_pgds(start, end); +} + +/* * NOTE: This function is marked __ref because it calls __init function * (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0. */ _ Patches currently in -mm which might be from harry.yoo(a)oracle.com are mm-move-page-table-sync-declarations-to-linux-pgtableh.patch mm-introduce-and-use-pgdp4d_populate_kernel.patch x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch

4 months, 2 weeks

+ mm-introduce-and-use-pgdp4d_populate_kernel.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm: introduce and use {pgd,p4d}_populate_kernel() has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-introduce-and-use-pgdp4d_populate_kernel.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Harry Yoo <harry.yoo(a)oracle.com> Subject: mm: introduce and use {pgd,p4d}_populate_kernel() Date: Mon, 18 Aug 2025 11:02:05 +0900 Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com> Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com> Acked-by: Kiryl Shutsemau <kas(a)kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Alexander Potapenko <glider(a)google.com> Cc: Alistair Popple <apopple(a)nvidia.com> Cc: Andrey Konovalov <andreyknvl(a)gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Ard Biesheuvel <ardb(a)kernel.org> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: bibo mao <maobibo(a)loongson.cn> Cc: Borislav Betkov <bp(a)alien8.de> Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org> Cc: Dennis Zhou <dennis(a)kernel.org> Cc: Dev Jain <dev.jain(a)arm.com> Cc: Dmitriy Vyukov <dvyukov(a)google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Jane Chu <jane.chu(a)oracle.com> Cc: Joao Martins <joao.m.martins(a)oracle.com> Cc: Joerg Roedel <joro(a)8bytes.org> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Kevin Brodsky <kevin.brodsky(a)arm.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: Peter Xu <peterx(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: Tejun Heo <tj(a)kernel.org> Cc: Thomas Gleinxer <tglx(a)linutronix.de> Cc: Thomas Huth <thuth(a)redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/pgalloc.h | 24 ++++++++++++++++++++++++ include/linux/pgtable.h | 13 +++++++------ mm/kasan/init.c | 12 ++++++------ mm/percpu.c | 6 +++--- mm/sparse-vmemmap.c | 6 +++--- 5 files changed, 43 insertions(+), 18 deletions(-) diff --git a/include/linux/pgalloc.h a/include/linux/pgalloc.h new file mode 100644 --- /dev/null +++ a/include/linux/pgalloc.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PGALLOC_H +#define _LINUX_PGALLOC_H + +#include <linux/pgtable.h> +#include <asm/pgalloc.h> + +static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, + p4d_t *p4d) +{ + pgd_populate(&init_mm, pgd, p4d); + if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) + arch_sync_kernel_mappings(addr, addr); +} + +static inline void p4d_populate_kernel(unsigned long addr, p4d_t *p4d, + pud_t *pud) +{ + p4d_populate(&init_mm, p4d, pud); + if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_P4D_MODIFIED) + arch_sync_kernel_mappings(addr, addr); +} + +#endif /* _LINUX_PGALLOC_H */ --- a/include/linux/pgtable.h~mm-introduce-and-use-pgdp4d_populate_kernel +++ a/include/linux/pgtable.h @@ -1469,8 +1469,8 @@ static inline void modify_prot_commit_pt /* * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values - * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() - * needs to be called. + * and let generic vmalloc, ioremap and page table update code know when + * arch_sync_kernel_mappings() needs to be called. */ #ifndef ARCH_PAGE_TABLE_SYNC_MASK #define ARCH_PAGE_TABLE_SYNC_MASK 0 @@ -1954,10 +1954,11 @@ static inline bool arch_has_pfn_modify_c /* * Page Table Modification bits for pgtbl_mod_mask. * - * These are used by the p?d_alloc_track*() set of functions an in the generic - * vmalloc/ioremap code to track at which page-table levels entries have been - * modified. Based on that the code can better decide when vmalloc and ioremap - * mapping changes need to be synchronized to other page-tables in the system. + * These are used by the p?d_alloc_track*() and p*d_populate_kernel() + * functions in the generic vmalloc, ioremap and page table update code + * to track at which page-table levels entries have been modified. + * Based on that the code can better decide when page table changes need + * to be synchronized to other page-tables in the system. */ #define __PGTBL_PGD_MODIFIED 0 #define __PGTBL_P4D_MODIFIED 1 --- a/mm/kasan/init.c~mm-introduce-and-use-pgdp4d_populate_kernel +++ a/mm/kasan/init.c @@ -13,9 +13,9 @@ #include <linux/mm.h> #include <linux/pfn.h> #include <linux/slab.h> +#include <linux/pgalloc.h> #include <asm/page.h> -#include <asm/pgalloc.h> #include "kasan.h" @@ -191,7 +191,7 @@ static int __ref zero_p4d_populate(pgd_t pud_t *pud; pmd_t *pmd; - p4d_populate(&init_mm, p4d, + p4d_populate_kernel(addr, p4d, lm_alias(kasan_early_shadow_pud)); pud = pud_offset(p4d, addr); pud_populate(&init_mm, pud, @@ -212,7 +212,7 @@ static int __ref zero_p4d_populate(pgd_t } else { p = early_alloc(PAGE_SIZE, NUMA_NO_NODE); pud_init(p); - p4d_populate(&init_mm, p4d, p); + p4d_populate_kernel(addr, p4d, p); } } zero_pud_populate(p4d, addr, next); @@ -251,10 +251,10 @@ int __ref kasan_populate_early_shadow(co * puds,pmds, so pgd_populate(), pud_populate() * is noops. */ - pgd_populate(&init_mm, pgd, + pgd_populate_kernel(addr, pgd, lm_alias(kasan_early_shadow_p4d)); p4d = p4d_offset(pgd, addr); - p4d_populate(&init_mm, p4d, + p4d_populate_kernel(addr, p4d, lm_alias(kasan_early_shadow_pud)); pud = pud_offset(p4d, addr); pud_populate(&init_mm, pud, @@ -273,7 +273,7 @@ int __ref kasan_populate_early_shadow(co if (!p) return -ENOMEM; } else { - pgd_populate(&init_mm, pgd, + pgd_populate_kernel(addr, pgd, early_alloc(PAGE_SIZE, NUMA_NO_NODE)); } } --- a/mm/percpu.c~mm-introduce-and-use-pgdp4d_populate_kernel +++ a/mm/percpu.c @@ -3108,7 +3108,7 @@ out_free: #endif /* BUILD_EMBED_FIRST_CHUNK */ #ifdef BUILD_PAGE_FIRST_CHUNK -#include <asm/pgalloc.h> +#include <linux/pgalloc.h> #ifndef P4D_TABLE_SIZE #define P4D_TABLE_SIZE PAGE_SIZE @@ -3134,13 +3134,13 @@ void __init __weak pcpu_populate_pte(uns if (pgd_none(*pgd)) { p4d = memblock_alloc_or_panic(P4D_TABLE_SIZE, P4D_TABLE_SIZE); - pgd_populate(&init_mm, pgd, p4d); + pgd_populate_kernel(addr, pgd, p4d); } p4d = p4d_offset(pgd, addr); if (p4d_none(*p4d)) { pud = memblock_alloc_or_panic(PUD_TABLE_SIZE, PUD_TABLE_SIZE); - p4d_populate(&init_mm, p4d, pud); + p4d_populate_kernel(addr, p4d, pud); } pud = pud_offset(p4d, addr); --- a/mm/sparse-vmemmap.c~mm-introduce-and-use-pgdp4d_populate_kernel +++ a/mm/sparse-vmemmap.c @@ -27,9 +27,9 @@ #include <linux/spinlock.h> #include <linux/vmalloc.h> #include <linux/sched.h> +#include <linux/pgalloc.h> #include <asm/dma.h> -#include <asm/pgalloc.h> #include <asm/tlbflush.h> #include "hugetlb_vmemmap.h" @@ -229,7 +229,7 @@ p4d_t * __meminit vmemmap_p4d_populate(p if (!p) return NULL; pud_init(p); - p4d_populate(&init_mm, p4d, p); + p4d_populate_kernel(addr, p4d, p); } return p4d; } @@ -241,7 +241,7 @@ pgd_t * __meminit vmemmap_pgd_populate(u void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; - pgd_populate(&init_mm, pgd, p); + pgd_populate_kernel(addr, pgd, p); } return pgd; } _ Patches currently in -mm which might be from harry.yoo(a)oracle.com are mm-move-page-table-sync-declarations-to-linux-pgtableh.patch mm-introduce-and-use-pgdp4d_populate_kernel.patch x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch

4 months, 2 weeks

+ mm-move-page-table-sync-declarations-to-linux-pgtableh.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm: move page table sync declarations to linux/pgtable.h has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-move-page-table-sync-declarations-to-linux-pgtableh.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Harry Yoo <harry.yoo(a)oracle.com> Subject: mm: move page table sync declarations to linux/pgtable.h Date: Mon, 18 Aug 2025 11:02:04 +0900 During our internal testing, we started observing intermittent boot failures when the machine uses 4-level paging and has a large amount of persistent memory: BUG: unable to handle page fault for address: ffffe70000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI RIP: 0010:__init_single_page+0x9/0x6d Call Trace: <TASK> __init_zone_device_page+0x17/0x5d memmap_init_zone_device+0x154/0x1bb pagemap_range+0x2e0/0x40f memremap_pages+0x10b/0x2f0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0xce/0x2ec [device_dax] dax_bus_probe+0x6d/0xc9 [... snip ...] </TASK> It turns out that the kernel panics while initializing vmemmap (struct page array) when the vmemmap region spans two PGD entries, because the new PGD entry is only installed in init_mm.pgd, but not in the page tables of other tasks. And looking at __populate_section_memmap(): if (vmemmap_can_optimize(altmap, pgmap)) // does not sync top level page tables r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); else // sync top level page tables in x86 r = vmemmap_populate(start, end, nid, altmap); In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c synchronizes the top level page table (See commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so that all tasks in the system can see the new vmemmap area. However, when vmemmap_can_optimize() returns true, the optimized path skips synchronization of top-level page tables. This is because vmemmap_populate_compound_pages() is implemented in core MM code, which does not handle synchronization of the top-level page tables. Instead, the core MM has historically relied on each architecture to perform this synchronization manually. We're not the first party to encounter a crash caused by not-sync'd top level page tables: earlier this year, Gwan-gyeong Mun attempted to address the issue [1] [2] after hitting a kernel panic when x86 code accessed the vmemmap area before the corresponding top-level entries were synced. At that time, the issue was believed to be triggered only when struct page was enlarged for debugging purposes, and the patch did not get further updates. It turns out that current approach of relying on each arch to handle the page table sync manually is fragile because 1) it's easy to forget to sync the top level page table, and 2) it's also easy to overlook that the kernel should not access the vmemmap and direct mapping areas before the sync. # The solution: Make page table sync more code robust and harder to miss To address this, Dave Hansen suggested [3] [4] introducing {pgd,p4d}_populate_kernel() for updating kernel portion of the page tables and allow each architecture to explicitly perform synchronization when installing top-level entries. With this approach, we no longer need to worry about missing the sync step, reducing the risk of future regressions. The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by vmalloc and ioremap to synchronize page tables. pgd_populate_kernel() looks like this: static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, p4d_t *p4d) { pgd_populate(&init_mm, pgd, p4d); if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) arch_sync_kernel_mappings(addr, addr); } It is worth noting that vmalloc() and apply_to_range() carefully synchronizes page tables by calling p*d_alloc_track() and arch_sync_kernel_mappings(), and thus they are not affected by this patch series. This series was hugely inspired by Dave Hansen's suggestion and hence added Suggested-by: Dave Hansen. Cc stable because lack of this series opens the door to intermittent boot failures. This patch (of 3): Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to linux/pgtable.h so that they can be used outside of vmalloc and ioremap. Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@in… [1] Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@in… [2] Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel… [3] Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel… [4] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com> Acked-by: Kiryl Shutsemau <kas(a)kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Alexander Potapenko <glider(a)google.com> Cc: Alistair Popple <apopple(a)nvidia.com> Cc: Andrey Konovalov <andreyknvl(a)gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Ard Biesheuvel <ardb(a)kernel.org> Cc: Arnd Bergmann <arnd(a)arndb.de> Cc: bibo mao <maobibo(a)loongson.cn> Cc: Borislav Betkov <bp(a)alien8.de> Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org> Cc: Dennis Zhou <dennis(a)kernel.org> Cc: Dev Jain <dev.jain(a)arm.com> Cc: Dmitriy Vyukov <dvyukov(a)google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Jane Chu <jane.chu(a)oracle.com> Cc: Joao Martins <joao.m.martins(a)oracle.com> Cc: Joerg Roedel <joro(a)8bytes.org> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Kevin Brodsky <kevin.brodsky(a)arm.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: Peter Xu <peterx(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: Tejun Heo <tj(a)kernel.org> Cc: Thomas Gleinxer <tglx(a)linutronix.de> Cc: Thomas Huth <thuth(a)redhat.com> Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/pgtable.h | 16 ++++++++++++++++ include/linux/vmalloc.h | 16 ---------------- 2 files changed, 16 insertions(+), 16 deletions(-) --- a/include/linux/pgtable.h~mm-move-page-table-sync-declarations-to-linux-pgtableh +++ a/include/linux/pgtable.h @@ -1467,6 +1467,22 @@ static inline void modify_prot_commit_pt } #endif +/* + * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values + * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() + * needs to be called. + */ +#ifndef ARCH_PAGE_TABLE_SYNC_MASK +#define ARCH_PAGE_TABLE_SYNC_MASK 0 +#endif + +/* + * There is no default implementation for arch_sync_kernel_mappings(). It is + * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK + * is 0. + */ +void arch_sync_kernel_mappings(unsigned long start, unsigned long end); + #endif /* CONFIG_MMU */ /* --- a/include/linux/vmalloc.h~mm-move-page-table-sync-declarations-to-linux-pgtableh +++ a/include/linux/vmalloc.h @@ -220,22 +220,6 @@ int vmap_pages_range(unsigned long addr, struct page **pages, unsigned int page_shift); /* - * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values - * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() - * needs to be called. - */ -#ifndef ARCH_PAGE_TABLE_SYNC_MASK -#define ARCH_PAGE_TABLE_SYNC_MASK 0 -#endif - -/* - * There is no default implementation for arch_sync_kernel_mappings(). It is - * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK - * is 0. - */ -void arch_sync_kernel_mappings(unsigned long start, unsigned long end); - -/* * Lowlevel-APIs (not for driver use!) */ _ Patches currently in -mm which might be from harry.yoo(a)oracle.com are mm-move-page-table-sync-declarations-to-linux-pgtableh.patch mm-introduce-and-use-pgdp4d_populate_kernel.patch x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch

4 months, 2 weeks

+ of_numa-fix-uninitialized-memory-nodes-causing-kernel-panic.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: of_numa: fix uninitialized memory nodes causing kernel panic has been added to the -mm mm-hotfixes-unstable branch. Its filename is of_numa-fix-uninitialized-memory-nodes-causing-kernel-panic.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Yin Tirui <yintirui(a)huawei.com> Subject: of_numa: fix uninitialized memory nodes causing kernel panic Date: Tue, 19 Aug 2025 15:55:10 +0800 When there are memory-only nodes (nodes without CPUs), these nodes are not properly initialized, causing kernel panic during boot. of_numa_init of_numa_parse_cpu_nodes node_set(nid, numa_nodes_parsed); of_numa_parse_memory_nodes In of_numa_parse_cpu_nodes, numa_nodes_parsed gets updated only for nodes containing CPUs. Memory-only nodes should have been updated in of_numa_parse_memory_nodes, but they weren't. Subsequently, when free_area_init() attempts to access NODE_DATA() for these uninitialized memory nodes, the kernel panics due to NULL pointer dereference. This can be reproduced on ARM64 QEMU with 1 CPU and 2 memory nodes: qemu-system-aarch64 \ -cpu host -nographic \ -m 4G -smp 1 \ -machine virt,accel=kvm,gic-version=3,iommu=smmuv3 \ -object memory-backend-ram,size=2G,id=mem0 \ -object memory-backend-ram,size=2G,id=mem1 \ -numa node,nodeid=0,memdev=mem0 \ -numa node,nodeid=1,memdev=mem1 \ -kernel $IMAGE \ -hda $DISK \ -append "console=ttyAMA0 root=/dev/vda rw earlycon" [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x481fd010] [ 0.000000] Linux version 6.17.0-rc1-00001-gabb4b3daf18c-dirty (yintirui@local) (gcc (GCC) 12.3.1, GNU ld (GNU Binutils) 2.41) #52 SMP PREEMPT Mon Aug 18 09:49:40 CST 2025 [ 0.000000] KASLR enabled [ 0.000000] random: crng init done [ 0.000000] Machine model: linux,dummy-virt [ 0.000000] efi: UEFI not found. [ 0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '') [ 0.000000] printk: legacy bootconsole [pl11] enabled [ 0.000000] OF: reserved mem: Reserved memory: No reserved-memory node in the DT [ 0.000000] NODE_DATA(0) allocated [mem 0xbfffd9c0-0xbfffffff] [ 0.000000] node 1 must be removed before remove section 23 [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] [ 0.000000] DMA32 empty [ 0.000000] Normal [mem 0x0000000100000000-0x000000013fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000040000000-0x00000000bfffffff] [ 0.000000] node 1: [mem 0x00000000c0000000-0x000000013fffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x00000000bfffffff] [ 0.000000] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0 [ 0.000000] Mem abort info: [ 0.000000] ESR = 0x0000000096000004 [ 0.000000] EC = 0x25: DABT (current EL), IL = 32 bits [ 0.000000] SET = 0, FnV = 0 [ 0.000000] EA = 0, S1PTW = 0 [ 0.000000] FSC = 0x04: level 0 translation fault [ 0.000000] Data abort info: [ 0.000000] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 0.000000] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 0.000000] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 0.000000] [00000000000000a0] user address but active_mm is swapper [ 0.000000] Internal error: Oops: 0000000096000004 [#1] SMP [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc1-00001-g760c6dabf762-dirty #54 PREEMPT [ 0.000000] Hardware name: linux,dummy-virt (DT) [ 0.000000] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 0.000000] pc : free_area_init+0x50c/0xf9c [ 0.000000] lr : free_area_init+0x5c0/0xf9c [ 0.000000] sp : ffffa02ca0f33c00 [ 0.000000] x29: ffffa02ca0f33cb0 x28: 0000000000000000 x27: 0000000000000000 [ 0.000000] x26: 4ec4ec4ec4ec4ec5 x25: 00000000000c0000 x24: 00000000000c0000 [ 0.000000] x23: 0000000000040000 x22: 0000000000000000 x21: ffffa02ca0f3b368 [ 0.000000] x20: ffffa02ca14c7b98 x19: 0000000000000000 x18: 0000000000000002 [ 0.000000] x17: 000000000000cacc x16: 0000000000000001 x15: 0000000000000001 [ 0.000000] x14: 0000000080000000 x13: 0000000000000018 x12: 0000000000000002 [ 0.000000] x11: ffffa02ca0fd4f00 x10: ffffa02ca14bab20 x9 : ffffa02ca14bab38 [ 0.000000] x8 : 00000000000c0000 x7 : 0000000000000001 x6 : 0000000000000002 [ 0.000000] x5 : 0000000140000000 x4 : ffffa02ca0f33c90 x3 : ffffa02ca0f33ca0 [ 0.000000] x2 : ffffa02ca0f33c98 x1 : 0000000080000000 x0 : 0000000000000001 [ 0.000000] Call trace: [ 0.000000] free_area_init+0x50c/0xf9c (P) [ 0.000000] bootmem_init+0x110/0x1dc [ 0.000000] setup_arch+0x278/0x60c [ 0.000000] start_kernel+0x70/0x748 [ 0.000000] __primary_switched+0x88/0x90 [ 0.000000] Code: d503201f b98093e0 52800016 f8607a93 (f9405260) [ 0.000000] ---[ end trace 0000000000000000 ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Link: https://lkml.kernel.org/r/20250819075510.2079961-1-yintirui@huawei.com Fixes: 767507654c22 ("arch_numa: switch over to numa_memblks") Signed-off-by: Yin Tirui <yintirui(a)huawei.com> Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang(a)huawei.com> Cc: Chen Jun <chenjun102(a)huawei.com> Cc: Dan Williams <dan.j.williams(a)intel.com> Cc: Joanthan Cameron <Jonathan.Cameron(a)huawei.com> Cc: Rob Herring <robh(a)kernel.org> Cc: Saravana Kannan <saravanak(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- drivers/of/of_numa.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/drivers/of/of_numa.c~of_numa-fix-uninitialized-memory-nodes-causing-kernel-panic +++ a/drivers/of/of_numa.c @@ -59,8 +59,11 @@ static int __init of_numa_parse_memory_n r = -EINVAL; } - for (i = 0; !r && !of_address_to_resource(np, i, &rsrc); i++) + for (i = 0; !r && !of_address_to_resource(np, i, &rsrc); i++) { r = numa_add_memblk(nid, rsrc.start, rsrc.end + 1); + if (!r) + node_set(nid, numa_nodes_parsed); + } if (!i || r) { of_node_put(np); _ Patches currently in -mm which might be from yintirui(a)huawei.com are of_numa-fix-uninitialized-memory-nodes-causing-kernel-panic.patch

4 months, 2 weeks

[PATCH v5 0/5] i2c: rtl9300: Fix multi-byte I2C operations

by Sven Eckelmann

During the integration of the RTL8239 POE chip + its frontend MCU, it was noticed that multi-byte operations were basically broken in the current driver. Tests using SMBus Block Writes showed that the data (after the Wr maker + Ack) was mixed up on the wire. At first glance, it looked like an endianness problem. But for transfers where the number of count + data bytes was not divisible by 4, the last bytes were not looking like an endianness problem because they were in the wrong order but not for example 0 - which would be the case for an endianness problem with 32 bit registers. At the end, it turned out to be the way how i2c_write tried to add the bytes to the send registers. Each 32 bit register was used similar to a shift register - shifting the various bytes up the register while the next one is added to the least significant byte. But the I2C controller expects the first byte of the transmission in the least significant byte of the first register. And the last byte (assuming it is a 16 byte transfer) is expected in the most significant byte of the fourth register. While doing these tests, it was also observed that the count byte was missing from the SMBus Block Writes. The driver just removed them from the data->block (from the I2C subsystem). But the I2C controller DOES NOT automatically add this byte - for example by using the configured transmission length. The RTL8239 MCU is not actually an SMBus compliant device. Instead, it expects I2C Block Reads + I2C Block Writes. But according to the already identified bugs in the driver, it was clear that the I2C controller can simply be modified to not send the count byte for I2C_SMBUS_I2C_BLOCK_DATA. The receive part just needs to write the content of the receive buffer to the correct position in data->block. While the on-wire format was now correct, reads were still not possible against the MCU (for the RTL8239 POE chip). It was always timing out because the 2ms were not enough for sending the read request and then receiving the 12 byte answer. These changes were originally submitted to OpenWrt. But there are plans to migrate OpenWrt to the upstream Linux driver. As a result, the pull request was stopped and the changes were redone against this driver. For reasons of transparency: The work on I2C_SMBUS_I2C_BLOCK_DATA support for the RTL8239-MCU was done on RTL931xx. All problems were therefore detected with the patches from Jonas Jelonek [1] and not the vanilla Linux driver. But looking through the code, it seems like these are NOT regressions introduced by the RTL931x patchset. I've picked up Alex Guo's patch [2] to reduce conflicts between pending fixes. [1] https://patchwork.ozlabs.org/project/linux-i2c/cover/20250727114800.3046-1-… [2] https://lore.kernel.org/r/20250615235248.529019-1-alexguo1023@gmail.com Signed-off-by: Sven Eckelmann <sven(a)narfation.org> --- Changes in v5: - Simplify function/capability registration by using I2C_FUNC_SMBUS_I2C_BLOCK, thanks Jonas Jelonek - Link to v4: https://lore.kernel.org/r/20250809-i2c-rtl9300-multi-byte-v4-0-d71dd5eb6121… Changes in v4: - Provide only "write" examples for "i2c: rtl9300: Fix multi-byte I2C write" - drop the second initialization of vals in rtl9300_i2c_write() directly in the "Fix multi-byte I2C write" fix - indicate in target branch for each patch in PATCH prefix - minor commit message cleanups - Link to v3: https://lore.kernel.org/r/20250804-i2c-rtl9300-multi-byte-v3-0-e20607e1b28c… Changes in v3: - integrated patch https://lore.kernel.org/r/20250615235248.529019-1-alexguo1023@gmail.com to avoid conflicts in the I2C_SMBUS_BLOCK_DATA code - added Fixes and stable(a)vger.kernel.org to Alex Guo's patch - added Chris Packham's Reviewed-by/Acked-by - Link to v2: https://lore.kernel.org/r/20250803-i2c-rtl9300-multi-byte-v2-0-9b7b759fe2b6… Changes in v2: - add the missing transfer width and read length increase for the SMBus Write/Read - Link to v1: https://lore.kernel.org/r/20250802-i2c-rtl9300-multi-byte-v1-0-5f687e0098e2… --- Alex Guo (1): i2c: rtl9300: Fix out-of-bounds bug in rtl9300_i2c_smbus_xfer Harshal Gohel (2): [i2c-host-fixes] i2c: rtl9300: Fix multi-byte I2C write [i2c-host] i2c: rtl9300: Implement I2C block read and write Sven Eckelmann (2): [i2c-host-fixes] i2c: rtl9300: Increase timeout for transfer polling [i2c-host-fixes] i2c: rtl9300: Add missing count byte for SMBus Block Ops drivers/i2c/busses/i2c-rtl9300.c | 51 +++++++++++++++++++++++++++++++++------- 1 file changed, 42 insertions(+), 9 deletions(-) --- base-commit: 7e161a991ea71e6ec526abc8f40c6852ebe3d946 change-id: 20250802-i2c-rtl9300-multi-byte-edaa1fb0872c Best regards, -- Sven Eckelmann <sven(a)narfation.org>

4 months, 2 weeks

+ mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm/damon/core: set quota->charged_from to jiffies at first charge window has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Sang-Heon Jeon <ekffu200098(a)gmail.com> Subject: mm/damon/core: set quota->charged_from to jiffies at first charge window Date: Wed, 20 Aug 2025 00:01:23 +0900 Kernel initializes "jiffies" timer as 5 minutes below zero, as shown in include/linux/jiffies.h /* * Have the 32 bit jiffies value wrap 5 minutes after boot * so jiffies wrap bugs show up earlier. */ #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ)) And they cast unsigned value to signed to cover wraparound #define time_after_eq(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)((a) - (b)) >= 0)) In 64bit systems, these might not be a problem because wrapround occurs 300 million years after the boot, assuming HZ value is 1000. With same assuming, In 32bit system, wraparound occurs 5 minutues after the initial boot and every 49 days after the first wraparound. And about 25 days after first wraparound, it continues quota charging window up to next 25 days. Example 1: initial boot jiffies=0xFFFB6C20, charged_from+interval=0x000003E8 time_after_eq(jiffies, charged_from+interval)=(long)0xFFFB6838; In signed values, it is considered negative so it is false. Example 2: after about 25 days first wraparound jiffies=0x800004E8, charged_from+interval=0x000003E8 time_after_eq(jiffies, charged_from+interval)=(long)0x80000100; In signed values, it is considered negative so it is false So, change quota->charged_from to jiffies at damos_adjust_quota() when it is consider first charge window. In theory; but almost impossible; quota->total_charged_sz and qutoa->charged_from should be both zero even if it is not in first charge window. But It will only delay one reset_interval, So it is not big problem. Link: https://lkml.kernel.org/r/20250819150123.1532458-1-ekffu200098@gmail.com Fixes: 2b8a248d5873 ("mm/damon/schemes: implement size quota for schemes application speed control") [5.16] Signed-off-by: Sang-Heon Jeon <ekffu200098(a)gmail.com> Reviewed-by: SeongJae Park <sj(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/damon/core.c | 4 ++++ 1 file changed, 4 insertions(+) --- a/mm/damon/core.c~mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window +++ a/mm/damon/core.c @@ -2111,6 +2111,10 @@ static void damos_adjust_quota(struct da if (!quota->ms && !quota->sz && list_empty(&quota->goals)) return; + /* First charge window */ + if (!quota->total_charged_sz && !quota->charged_from) + quota->charged_from = jiffies; + /* New charge window starts */ if (time_after_eq(jiffies, quota->charged_from + msecs_to_jiffies(quota->reset_interval))) { _ Patches currently in -mm which might be from ekffu200098(a)gmail.com are mm-damon-core-fix-commit_ops_filters-by-using-correct-nth-function.patch selftests-damon-fix-selftests-by-installing-drgn-related-script.patch mm-damon-core-fix-damos_commit_filter-not-changing-allow.patch mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window.patch mm-damon-update-expired-description-of-damos_action.patch docs-mm-damon-design-fix-typo-s-sz_trtied-sz_tried.patch selftests-damon-test-no-op-commit-broke-damon-status.patch selftests-damon-test-no-op-commit-broke-damon-status-fix.patch mm-damon-tests-core-kunit-add-damos_commit_filter-test.patch

4 months, 2 weeks

[PATCH AUTOSEL 6.16] io_uring/io-wq: add check free worker before create new worker

by Sasha Levin

From: Fengnan Chang <changfengnan(a)bytedance.com> [ Upstream commit 9d83e1f05c98bab5de350bef89177e2be8b34db0 ] After commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress"), in our produce environment, we still observe that part of io_worker threads keeps creating and destroying. After analysis, it was confirmed that this was due to a more complex scenario involving a large number of fsync operations, which can be abstracted as frequent write + fsync operations on multiple files in a single uring instance. Since write is a hash operation while fsync is not, and fsync is likely to be suspended during execution, the action of checking the hash value in io_wqe_dec_running cannot handle such scenarios. Similarly, if hash-based work and non-hash-based work are sent at the same time, similar issues are likely to occur. Returning to the starting point of the issue, when a new work arrives, io_wq_enqueue may wake up free worker A, while io_wq_dec_running may create worker B. Ultimately, only one of A and B can obtain and process the task, leaving the other in an idle state. In the end, the issue is caused by inconsistent logic in the checks performed by io_wq_enqueue and io_wq_dec_running. Therefore, the problem can be resolved by checking for available workers in io_wq_dec_running. Signed-off-by: Fengnan Chang <changfengnan(a)bytedance.com> Reviewed-by: Diangang Li <lidiangang(a)bytedance.com> Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe <axboe(a)kernel.dk> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- LLM Generated explanations, may be completely bogus: Now let me analyze the nature of the fix to determine if it's appropriate for stable: **Backport Status: YES** This commit should be backported to stable kernel trees for the following reasons: ## Bug Fix Nature The commit fixes a **real performance bug** that occurs in production environments. The issue causes unnecessary creation and destruction of io_worker threads, leading to: 1. CPU overhead from constant thread creation/destruction 2. Resource wastage from redundant workers 3. Performance degradation in production workloads with mixed hash/non- hash operations ## Root Cause Analysis The patch addresses a **race condition** between `io_wq_enqueue()` and `io_wq_dec_running()`: - When new work arrives, `io_wq_enqueue()` may wake up a free worker A - Simultaneously, `io_wq_dec_running()` may create a new worker B - Only one worker can actually process the task, leaving the other idle - This creates a worker "churn" pattern that wastes resources ## Small and Contained Fix The fix is minimal and surgical: 1. Adds a check in `create_worker_cb()` to verify if a free worker is available before creating a new one 2. Uses the existing `io_acct_activate_free_worker()` function (lines 361-365) 3. Only adds 8 lines of code with proper RCU locking 4. Introduces a single goto label for clean error handling ## Production Impact The commit message explicitly states this was observed in **production environments** with real workloads involving: - Frequent write + fsync operations - Mixed hash and non-hash operations - Multiple files in a single io_uring instance ## Regression Potential The fix has **low regression risk** because: 1. It reuses existing, well-tested infrastructure (`io_acct_activate_free_worker()`) 2. The logic is consistent with what `io_wq_enqueue()` already does 3. It adds a defensive check rather than changing core logic 4. The RCU locking is properly scoped and follows existing patterns ## Stable Kernel Rules Compliance The patch follows stable kernel criteria: - ✓ Fixes a real bug affecting users (worker thread churn in production) - ✓ Small change (8 lines added) - ✓ Obviously correct (aligns logic between enqueue and dec_running paths) - ✓ Already tested in production (reported by ByteDance engineers) - ✓ Not a new feature or optimization ## Affected Versions This should be backported to kernels containing commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress") which introduced incomplete handling of the worker creation logic. The issue affects mixed workload scenarios that are common in production environments. The fact that this was discovered and reported by engineers from a major cloud provider (ByteDance) running production workloads further validates its importance for stable backporting. io_uring/io-wq.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index be91edf34f01..17dfaa0395c4 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -357,6 +357,13 @@ static void create_worker_cb(struct callback_head *cb) worker = container_of(cb, struct io_worker, create_work); wq = worker->wq; acct = worker->acct; + + rcu_read_lock(); + do_create = !io_acct_activate_free_worker(acct); + rcu_read_unlock(); + if (!do_create) + goto no_need_create; + raw_spin_lock(&acct->workers_lock); if (acct->nr_workers < acct->max_workers) { @@ -367,6 +374,7 @@ static void create_worker_cb(struct callback_head *cb) if (do_create) { create_io_worker(wq, acct); } else { +no_need_create: atomic_dec(&acct->nr_running); io_worker_ref_put(wq); } -- 2.50.1

4 months, 2 weeks

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror August 2025