- Linux-stable-mirror - lists.linaro.org

[merged mm-hotfixes-stable] mm-hugetlb-unshare-page-tables-during-vma-split-not-before.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/hugetlb: unshare page tables during VMA split, not before has been removed from the -mm tree. Its filename was mm-hugetlb-unshare-page-tables-during-vma-split-not-before.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Jann Horn <jannh(a)google.com> Subject: mm/hugetlb: unshare page tables during VMA split, not before Date: Tue, 27 May 2025 23:23:53 +0200 Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh(a)google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1… Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1… Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58… Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh(a)google.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/hugetlb.h | 3 + mm/hugetlb.c | 60 +++++++++++++++++++++-------- mm/vma.c | 7 +++ tools/testing/vma/vma_internal.h | 2 4 files changed, 56 insertions(+), 16 deletions(-) --- a/include/linux/hugetlb.h~mm-hugetlb-unshare-page-tables-during-vma-split-not-before +++ a/include/linux/hugetlb.h @@ -279,6 +279,7 @@ bool is_hugetlb_entry_migration(pte_t pt bool is_hugetlb_entry_hwpoisoned(pte_t pte); void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); void fixup_hugetlb_reservations(struct vm_area_struct *vma); +void hugetlb_split(struct vm_area_struct *vma, unsigned long addr); #else /* !CONFIG_HUGETLB_PAGE */ @@ -476,6 +477,8 @@ static inline void fixup_hugetlb_reserva { } +static inline void hugetlb_split(struct vm_area_struct *vma, unsigned long addr) {} + #endif /* !CONFIG_HUGETLB_PAGE */ #ifndef pgd_write --- a/mm/hugetlb.c~mm-hugetlb-unshare-page-tables-during-vma-split-not-before +++ a/mm/hugetlb.c @@ -121,7 +121,7 @@ static void hugetlb_vma_lock_free(struct static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma); static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma); static void hugetlb_unshare_pmds(struct vm_area_struct *vma, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, bool take_locks); static struct resv_map *vma_resv_map(struct vm_area_struct *vma); static void hugetlb_free_folio(struct folio *folio) @@ -5426,26 +5426,40 @@ static int hugetlb_vm_op_split(struct vm { if (addr & ~(huge_page_mask(hstate_vma(vma)))) return -EINVAL; + return 0; +} +void hugetlb_split(struct vm_area_struct *vma, unsigned long addr) +{ /* * PMD sharing is only possible for PUD_SIZE-aligned address ranges * in HugeTLB VMAs. If we will lose PUD_SIZE alignment due to this * split, unshare PMDs in the PUD_SIZE interval surrounding addr now. + * This function is called in the middle of a VMA split operation, with + * MM, VMA and rmap all write-locked to prevent concurrent page table + * walks (except hardware and gup_fast()). */ + vma_assert_write_locked(vma); + i_mmap_assert_write_locked(vma->vm_file->f_mapping); + if (addr & ~PUD_MASK) { - /* - * hugetlb_vm_op_split is called right before we attempt to - * split the VMA. We will need to unshare PMDs in the old and - * new VMAs, so let's unshare before we split. - */ unsigned long floor = addr & PUD_MASK; unsigned long ceil = floor + PUD_SIZE; - if (floor >= vma->vm_start && ceil <= vma->vm_end) - hugetlb_unshare_pmds(vma, floor, ceil); + if (floor >= vma->vm_start && ceil <= vma->vm_end) { + /* + * Locking: + * Use take_locks=false here. + * The file rmap lock is already held. + * The hugetlb VMA lock can't be taken when we already + * hold the file rmap lock, and we don't need it because + * its purpose is to synchronize against concurrent page + * table walks, which are not possible thanks to the + * locks held by our caller. + */ + hugetlb_unshare_pmds(vma, floor, ceil, /* take_locks = */ false); + } } - - return 0; } static unsigned long hugetlb_vm_op_pagesize(struct vm_area_struct *vma) @@ -7885,9 +7899,16 @@ void move_hugetlb_state(struct folio *ol spin_unlock_irq(&hugetlb_lock); } +/* + * If @take_locks is false, the caller must ensure that no concurrent page table + * access can happen (except for gup_fast() and hardware page walks). + * If @take_locks is true, we take the hugetlb VMA lock (to lock out things like + * concurrent page fault handling) and the file rmap lock. + */ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, + bool take_locks) { struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); @@ -7911,8 +7932,12 @@ static void hugetlb_unshare_pmds(struct mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start, end); mmu_notifier_invalidate_range_start(&range); - hugetlb_vma_lock_write(vma); - i_mmap_lock_write(vma->vm_file->f_mapping); + if (take_locks) { + hugetlb_vma_lock_write(vma); + i_mmap_lock_write(vma->vm_file->f_mapping); + } else { + i_mmap_assert_write_locked(vma->vm_file->f_mapping); + } for (address = start; address < end; address += PUD_SIZE) { ptep = hugetlb_walk(vma, address, sz); if (!ptep) @@ -7922,8 +7947,10 @@ static void hugetlb_unshare_pmds(struct spin_unlock(ptl); } flush_hugetlb_tlb_range(vma, start, end); - i_mmap_unlock_write(vma->vm_file->f_mapping); - hugetlb_vma_unlock_write(vma); + if (take_locks) { + i_mmap_unlock_write(vma->vm_file->f_mapping); + hugetlb_vma_unlock_write(vma); + } /* * No need to call mmu_notifier_arch_invalidate_secondary_tlbs(), see * Documentation/mm/mmu_notifier.rst. @@ -7938,7 +7965,8 @@ static void hugetlb_unshare_pmds(struct void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { hugetlb_unshare_pmds(vma, ALIGN(vma->vm_start, PUD_SIZE), - ALIGN_DOWN(vma->vm_end, PUD_SIZE)); + ALIGN_DOWN(vma->vm_end, PUD_SIZE), + /* take_locks = */ true); } /* --- a/mm/vma.c~mm-hugetlb-unshare-page-tables-during-vma-split-not-before +++ a/mm/vma.c @@ -539,7 +539,14 @@ __split_vma(struct vma_iterator *vmi, st init_vma_prep(&vp, vma); vp.insert = new; vma_prepare(&vp); + + /* + * Get rid of huge pages and shared page tables straddling the split + * boundary. + */ vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL); + if (is_vm_hugetlb_page(vma)) + hugetlb_split(vma, addr); if (new_below) { vma->vm_start = addr; --- a/tools/testing/vma/vma_internal.h~mm-hugetlb-unshare-page-tables-during-vma-split-not-before +++ a/tools/testing/vma/vma_internal.h @@ -932,6 +932,8 @@ static inline void vma_adjust_trans_huge (void)next; } +static inline void hugetlb_split(struct vm_area_struct *, unsigned long) {} + static inline void vma_iter_free(struct vma_iterator *vmi) { mas_destroy(&vmi->mas); _ Patches currently in -mm which might be from jannh(a)google.com are hugetlb-block-hugetlb-file-creation-if-hugetlb-is-not-set-up.patch

1 day, 23 hours

1
0
0 0

[merged mm-hotfixes-stable] alloc_tag-handle-module-codetag-load-errors-as-module-load-failures.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: alloc_tag: handle module codetag load errors as module load failures has been removed from the -mm tree. Its filename was alloc_tag-handle-module-codetag-load-errors-as-module-load-failures.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Suren Baghdasaryan <surenb(a)google.com> Subject: alloc_tag: handle module codetag load errors as module load failures Date: Wed, 21 May 2025 09:06:02 -0700 Failures inside codetag_load_module() are currently ignored. As a result an error there would not cause a module load failure and freeing of the associated resources. Correct this behavior by propagating the error code to the caller and handling possible errors. With this change, error to allocate percpu counters, which happens at this stage, will not be ignored and will cause a module load failure and freeing of resources. With this change we also do not need to disable memory allocation profiling when this error happens, instead we fail to load the module. Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Suren Baghdasaryan <surenb(a)google.com> Reported-by: Casey Chen <cachen(a)purestorage.com> Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/ Cc: Daniel Gomez <da.gomez(a)samsung.com> Cc: David Wang <00107082(a)163.com> Cc: Kent Overstreet <kent.overstreet(a)linux.dev> Cc: Luis Chamberalin <mcgrof(a)kernel.org> Cc: Petr Pavlu <petr.pavlu(a)suse.com> Cc: Sami Tolvanen <samitolvanen(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/codetag.h | 8 ++++---- kernel/module/main.c | 5 +++-- lib/alloc_tag.c | 12 +++++++----- lib/codetag.c | 34 +++++++++++++++++++++++++--------- 4 files changed, 39 insertions(+), 20 deletions(-) --- a/include/linux/codetag.h~alloc_tag-handle-module-codetag-load-errors-as-module-load-failures +++ a/include/linux/codetag.h @@ -36,8 +36,8 @@ union codetag_ref { struct codetag_type_desc { const char *section; size_t tag_size; - void (*module_load)(struct module *mod, - struct codetag *start, struct codetag *end); + int (*module_load)(struct module *mod, + struct codetag *start, struct codetag *end); void (*module_unload)(struct module *mod, struct codetag *start, struct codetag *end); #ifdef CONFIG_MODULES @@ -89,7 +89,7 @@ void *codetag_alloc_module_section(struc unsigned long align); void codetag_free_module_sections(struct module *mod); void codetag_module_replaced(struct module *mod, struct module *new_mod); -void codetag_load_module(struct module *mod); +int codetag_load_module(struct module *mod); void codetag_unload_module(struct module *mod); #else /* defined(CONFIG_CODE_TAGGING) && defined(CONFIG_MODULES) */ @@ -103,7 +103,7 @@ codetag_alloc_module_section(struct modu unsigned long align) { return NULL; } static inline void codetag_free_module_sections(struct module *mod) {} static inline void codetag_module_replaced(struct module *mod, struct module *new_mod) {} -static inline void codetag_load_module(struct module *mod) {} +static inline int codetag_load_module(struct module *mod) { return 0; } static inline void codetag_unload_module(struct module *mod) {} #endif /* defined(CONFIG_CODE_TAGGING) && defined(CONFIG_MODULES) */ --- a/kernel/module/main.c~alloc_tag-handle-module-codetag-load-errors-as-module-load-failures +++ a/kernel/module/main.c @@ -3386,11 +3386,12 @@ static int load_module(struct load_info goto sysfs_cleanup; } + if (codetag_load_module(mod)) + goto sysfs_cleanup; + /* Get rid of temporary copy. */ free_copy(info, flags); - codetag_load_module(mod); - /* Done! */ trace_module_load(mod); --- a/lib/alloc_tag.c~alloc_tag-handle-module-codetag-load-errors-as-module-load-failures +++ a/lib/alloc_tag.c @@ -607,15 +607,16 @@ out: mas_unlock(&mas); } -static void load_module(struct module *mod, struct codetag *start, struct codetag *stop) +static int load_module(struct module *mod, struct codetag *start, struct codetag *stop) { /* Allocate module alloc_tag percpu counters */ struct alloc_tag *start_tag; struct alloc_tag *stop_tag; struct alloc_tag *tag; + /* percpu counters for core allocations are already statically allocated */ if (!mod) - return; + return 0; start_tag = ct_to_alloc_tag(start); stop_tag = ct_to_alloc_tag(stop); @@ -627,12 +628,13 @@ static void load_module(struct module *m free_percpu(tag->counters); tag->counters = NULL; } - shutdown_mem_profiling(true); - pr_err("Failed to allocate memory for allocation tag percpu counters in the module %s. Memory allocation profiling is disabled!\n", + pr_err("Failed to allocate memory for allocation tag percpu counters in the module %s\n", mod->name); - break; + return -ENOMEM; } } + + return 0; } static void replace_module(struct module *mod, struct module *new_mod) --- a/lib/codetag.c~alloc_tag-handle-module-codetag-load-errors-as-module-load-failures +++ a/lib/codetag.c @@ -167,6 +167,7 @@ static int codetag_module_init(struct co { struct codetag_range range; struct codetag_module *cmod; + int mod_id; int err; range = get_section_range(mod, cttype->desc.section); @@ -190,11 +191,20 @@ static int codetag_module_init(struct co cmod->range = range; down_write(&cttype->mod_lock); - err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL); - if (err >= 0) { - cttype->count += range_size(cttype, &range); - if (cttype->desc.module_load) - cttype->desc.module_load(mod, range.start, range.stop); + mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL); + if (mod_id >= 0) { + if (cttype->desc.module_load) { + err = cttype->desc.module_load(mod, range.start, range.stop); + if (!err) + cttype->count += range_size(cttype, &range); + else + idr_remove(&cttype->mod_idr, mod_id); + } else { + cttype->count += range_size(cttype, &range); + err = 0; + } + } else { + err = mod_id; } up_write(&cttype->mod_lock); @@ -295,17 +305,23 @@ void codetag_module_replaced(struct modu mutex_unlock(&codetag_lock); } -void codetag_load_module(struct module *mod) +int codetag_load_module(struct module *mod) { struct codetag_type *cttype; + int ret = 0; if (!mod) - return; + return 0; mutex_lock(&codetag_lock); - list_for_each_entry(cttype, &codetag_types, link) - codetag_module_init(cttype, mod); + list_for_each_entry(cttype, &codetag_types, link) { + ret = codetag_module_init(cttype, mod); + if (ret) + break; + } mutex_unlock(&codetag_lock); + + return ret; } void codetag_unload_module(struct module *mod) _ Patches currently in -mm which might be from surenb(a)google.com are

1 day, 23 hours

1
0
0 0

[merged mm-hotfixes-stable] mm-madvise-handle-madvise_lock-failure-during-race-unwinding.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/madvise: handle madvise_lock() failure during race unwinding has been removed from the -mm tree. Its filename was mm-madvise-handle-madvise_lock-failure-during-race-unwinding.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: SeongJae Park <sj(a)kernel.org> Subject: mm/madvise: handle madvise_lock() failure during race unwinding Date: Mon, 2 Jun 2025 10:49:26 -0700 When unwinding race on -ERESTARTNOINTR handling of process_madvise(), madvise_lock() failure is ignored. Check the failure and abort remaining works in the case. Link: https://lkml.kernel.org/r/20250602174926.1074-1-sj@kernel.org Fixes: 4000e3d0a367 ("mm/madvise: remove redundant mmap_lock operations from process_madvise()") Signed-off-by: SeongJae Park <sj(a)kernel.org> Reported-by: Barry Song <21cnbao(a)gmail.com> Closes: https://lore.kernel.org/CAGsJ_4xJXXO0G+4BizhohSZ4yDteziPw43_uF8nPXPWxUVChzw… Reviewed-by: Jann Horn <jannh(a)google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Acked-by: David Hildenbrand <david(a)redhat.com> Reviewed-by: Shakeel Butt <shakeel.butt(a)linux.dev> Reviewed-by: Barry Song <baohua(a)kernel.org> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/madvise.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/mm/madvise.c~mm-madvise-handle-madvise_lock-failure-during-race-unwinding +++ a/mm/madvise.c @@ -1881,7 +1881,9 @@ static ssize_t vector_madvise(struct mm_ /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); - madvise_lock(mm, behavior); + ret = madvise_lock(mm, behavior); + if (ret) + goto out; madvise_init_tlb(&madv_behavior, mm); continue; } @@ -1892,6 +1894,7 @@ static ssize_t vector_madvise(struct mm_ madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); +out: ret = (total_len - iov_iter_count(iter)) ? : ret; return ret; _ Patches currently in -mm which might be from sj(a)kernel.org are mm-damon-introduce-damon_stat-module.patch mm-damon-introduce-damon_stat-module-fix.patch mm-damon-stat-calculate-and-expose-estimated-memory-bandwidth.patch mm-damon-stat-calculate-and-expose-idle-time-percentiles.patch docs-admin-guide-mm-damon-add-damon_stat-usage-document.patch

1 day, 23 hours

1
0
0 0

[merged mm-hotfixes-stable] kvm-s390-rename-prot_none-to-prot_type_dummy.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY has been removed from the -mm tree. Its filename was kvm-s390-rename-prot_none-to-prot_type_dummy.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Subject: KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY Date: Mon, 19 May 2025 15:56:57 +0100 The enum type prot_type declared in arch/s390/kvm/gaccess.c declares an unfortunate identifier within it - PROT_NONE. This clashes with the protection bit define from the uapi for mmap() declared in include/uapi/asm-generic/mman-common.h, which is indeed what those casually reading this code would assume this to refer to. This means that any changes which subsequently alter headers in any way which results in the uapi header being imported here will cause build errors. Resolve the issue by renaming PROT_NONE to PROT_TYPE_DUMMY. Link: https://lkml.kernel.org/r/20250519145657.178365-1-lorenzo.stoakes@oracle.com Fixes: b3cefd6bf16e ("KVM: s390: Pass initialized arg even if unused") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez(a)kuka.com> Reported-by: kernel test robot <lkp(a)intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505140943.IgHDa9s7-lkp@intel.com/ Acked-by: Christian Borntraeger <borntraeger(a)linux.ibm.com> Acked-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez(a)kuka.com> Acked-by: Yang Shi <yang(a)os.amperecomputing.com> Reviewed-by: David Hildenbrand <david(a)redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett(a)oracle.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.de> Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com> Cc: <stable(a)vger.kernel.org> Cc: Alexander Gordeev <agordeev(a)linux.ibm.com> Cc: Heiko Carstens <hca(a)linux.ibm.com> Cc: James Houghton <jthoughton(a)google.com> Cc: Janosch Frank <frankja(a)linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Sven Schnelle <svens(a)linux.ibm.com> Cc: Vasily Gorbik <gor(a)linux.ibm.com> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/s390/kvm/gaccess.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/arch/s390/kvm/gaccess.c~kvm-s390-rename-prot_none-to-prot_type_dummy +++ a/arch/s390/kvm/gaccess.c @@ -319,7 +319,7 @@ enum prot_type { PROT_TYPE_DAT = 3, PROT_TYPE_IEP = 4, /* Dummy value for passing an initialized value when code != PGM_PROTECTION */ - PROT_NONE, + PROT_TYPE_DUMMY, }; static int trans_exc_ending(struct kvm_vcpu *vcpu, int code, unsigned long gva, u8 ar, @@ -335,7 +335,7 @@ static int trans_exc_ending(struct kvm_v switch (code) { case PGM_PROTECTION: switch (prot) { - case PROT_NONE: + case PROT_TYPE_DUMMY: /* We should never get here, acts like termination */ WARN_ON_ONCE(1); break; @@ -805,7 +805,7 @@ static int guest_range_to_gpas(struct kv gpa = kvm_s390_real_to_abs(vcpu, ga); if (!kvm_is_gpa_in_memslot(vcpu->kvm, gpa)) { rc = PGM_ADDRESSING; - prot = PROT_NONE; + prot = PROT_TYPE_DUMMY; } } if (rc) @@ -963,7 +963,7 @@ int access_guest_with_key(struct kvm_vcp if (rc == PGM_PROTECTION) prot = PROT_TYPE_KEYC; else - prot = PROT_NONE; + prot = PROT_TYPE_DUMMY; rc = trans_exc_ending(vcpu, rc, ga, ar, mode, prot, terminate); } out_unlock: _ Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are docs-mm-expand-vma-doc-to-highlight-pte-freeing-non-vma-traversal.patch mm-ksm-have-ksm-vma-checks-not-require-a-vma-pointer.patch mm-ksm-refer-to-special-vmas-via-vm_special-in-ksm_compatible.patch mm-prevent-ksm-from-breaking-vma-merging-for-new-vmas.patch tools-testing-selftests-add-vma-merge-tests-for-ksm-merge.patch mm-pagewalk-split-walk_page_range_novma-into-kernel-user-parts.patch

1 day, 23 hours

1
0
0 0

[merged mm-stable] mm-fix-uprobe-pte-be-overwritten-when-expanding-vma.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: fix uprobe pte be overwritten when expanding vma has been removed from the -mm tree. Its filename was mm-fix-uprobe-pte-be-overwritten-when-expanding-vma.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Pu Lehui <pulehui(a)huawei.com> Subject: mm: fix uprobe pte be overwritten when expanding vma Date: Thu, 29 May 2025 15:56:47 +0000 Patch series "Fix uprobe pte be overwritten when expanding vma". This patch (of 4): We encountered a BUG alert triggered by Syzkaller as follows: BUG: Bad rss-counter state mm:00000000b4a60fca type:MM_ANONPAGES val:1 And we can reproduce it with the following steps: 1. register uprobe on file at zero offset 2. mmap the file at zero offset: addr1 = mmap(NULL, 2 * 4096, PROT_NONE, MAP_PRIVATE, fd, 0); 3. mremap part of vma1 to new vma2: addr2 = mremap(addr1, 4096, 2 * 4096, MREMAP_MAYMOVE); 4. mremap back to orig addr1: mremap(addr2, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, addr1); In step 3, the vma1 range [addr1, addr1 + 4096] will be remap to new vma2 with range [addr2, addr2 + 8192], and remap uprobe anon page from the vma1 to vma2, then unmap the vma1 range [addr1, addr1 + 4096]. In step 4, the vma2 range [addr2, addr2 + 4096] will be remap back to the addr range [addr1, addr1 + 4096]. Since the addr range [addr1 + 4096, addr1 + 8192] still maps the file, it will take vma_merge_new_range to expand the range, and then do uprobe_mmap in vma_complete. Since the merged vma pgoff is also zero offset, it will install uprobe anon page to the merged vma. However, the upcomming move_page_tables step, which use set_pte_at to remap the vma2 uprobe pte to the merged vma, will overwrite the newly uprobe pte in the merged vma, and lead that pte to be orphan. Since the uprobe pte will be remapped to the merged vma, we can remove the unnecessary uprobe_mmap upon merged vma. This problem was first found in linux-6.6.y and also exists in the community syzkaller: https://lore.kernel.org/all/000000000000ada39605a5e71711@google.com/T/ Link: https://lkml.kernel.org/r/20250529155650.4017699-1-pulehui@huaweicloud.com Link: https://lkml.kernel.org/r/20250529155650.4017699-2-pulehui@huaweicloud.com Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Signed-off-by: Pu Lehui <pulehui(a)huawei.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Jann Horn <jannh(a)google.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat(a)kernel.org> Cc: Oleg Nesterov <oleg(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vma.c | 20 +++++++++++++++++--- mm/vma.h | 7 +++++++ 2 files changed, 24 insertions(+), 3 deletions(-) --- a/mm/vma.c~mm-fix-uprobe-pte-be-overwritten-when-expanding-vma +++ a/mm/vma.c @@ -169,6 +169,9 @@ static void init_multi_vma_prep(struct v vp->file = vma->vm_file; if (vp->file) vp->mapping = vma->vm_file->f_mapping; + + if (vmg && vmg->skip_vma_uprobe) + vp->skip_vma_uprobe = true; } /* @@ -358,10 +361,13 @@ static void vma_complete(struct vma_prep if (vp->file) { i_mmap_unlock_write(vp->mapping); - uprobe_mmap(vp->vma); - if (vp->adj_next) - uprobe_mmap(vp->adj_next); + if (!vp->skip_vma_uprobe) { + uprobe_mmap(vp->vma); + + if (vp->adj_next) + uprobe_mmap(vp->adj_next); + } } if (vp->remove) { @@ -1823,6 +1829,14 @@ struct vm_area_struct *copy_vma(struct v faulted_in_anon_vma = false; } + /* + * If the VMA we are copying might contain a uprobe PTE, ensure + * that we do not establish one upon merge. Otherwise, when mremap() + * moves page tables, it will orphan the newly created PTE. + */ + if (vma->vm_file) + vmg.skip_vma_uprobe = true; + new_vma = find_vma_prev(mm, addr, &vmg.prev); if (new_vma && new_vma->vm_start < addr + len) return NULL; /* should never get here */ --- a/mm/vma.h~mm-fix-uprobe-pte-be-overwritten-when-expanding-vma +++ a/mm/vma.h @@ -19,6 +19,8 @@ struct vma_prepare { struct vm_area_struct *insert; struct vm_area_struct *remove; struct vm_area_struct *remove2; + + bool skip_vma_uprobe :1; }; struct unlink_vma_file_batch { @@ -120,6 +122,11 @@ struct vma_merge_struct { */ bool give_up_on_oom :1; + /* + * If set, skip uprobe_mmap upon merged vma. + */ + bool skip_vma_uprobe :1; + /* Internal flags set during merge process: */ /* _ Patches currently in -mm which might be from pulehui(a)huawei.com are

1 day, 23 hours

1
0
0 0

[PATCH 0/3] (no cover subject)

by Bjorn Andersson

Signed-off-by: Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com> --- Bjorn Andersson (3): soc: qcom: mdt_loader: Ensure we don't read past the ELF header soc: qcom: mdt_loader: Rename mdt_phdr_valid() soc: qcom: mdt_loader: Actually use the e_phoff drivers/soc/qcom/mdt_loader.c | 57 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 47 insertions(+), 10 deletions(-) --- base-commit: a0bea9e39035edc56a994630e6048c8a191a99d8 change-id: 20250604-mdt-loader-validation-and-fixes-5c3267f5b19f Best regards, -- Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com>

2 days, 1 hour

4
4
0 0

[PATCH] media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt()

by Gui-Dong Han

In the interrupt handler rain_interrupt(), the buffer full check on rain->buf_len is performed before acquiring rain->buf_lock. This creates a Time-of-Check to Time-of-Use (TOCTOU) race condition, as rain->buf_len is concurrently accessed and modified in the work handler rain_irq_work_handler() under the same lock. Multiple interrupt invocations can race, with each reading buf_len before it becomes full and then proceeding. This can lead to both interrupts attempting to write to the buffer, incrementing buf_len beyond its capacity (DATA_SIZE) and causing a buffer overflow. Fix this bug by moving the spin_lock() to before the buffer full check. This ensures that the check and the subsequent buffer modification are performed atomically, preventing the race condition. An corresponding spin_unlock() is added to the overflow path to correctly release the lock. This possible bug was found by an experimental static analysis tool developed by our team. Fixes: 0f314f6c2e77 ("[media] rainshadow-cec: new RainShadow Tech HDMI CEC driver") Cc: stable(a)vger.kernel.org Signed-off-by: Gui-Dong Han <hanguidong02(a)gmail.com> --- drivers/media/cec/usb/rainshadow/rainshadow-cec.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/media/cec/usb/rainshadow/rainshadow-cec.c b/drivers/media/cec/usb/rainshadow/rainshadow-cec.c index ee870ea1a886..6f8d6797c614 100644 --- a/drivers/media/cec/usb/rainshadow/rainshadow-cec.c +++ b/drivers/media/cec/usb/rainshadow/rainshadow-cec.c @@ -171,11 +171,12 @@ static irqreturn_t rain_interrupt(struct serio *serio, unsigned char data, { struct rain *rain = serio_get_drvdata(serio); + spin_lock(&rain->buf_lock); if (rain->buf_len == DATA_SIZE) { + spin_unlock(&rain->buf_lock); dev_warn_once(rain->dev, "buffer overflow\n"); return IRQ_HANDLED; } - spin_lock(&rain->buf_lock); rain->buf_len++; rain->buf[rain->buf_wr_idx] = data; rain->buf_wr_idx = (rain->buf_wr_idx + 1) & 0xff; -- 2.25.1

2 days, 1 hour

1
0
0 0

[PATCH 6.12 1/1] PCI/ASPM: Disable L1 before disabling L1 PM Substates

by Macpaul Lin

From: Ajay Agarwal <ajayagarwal(a)google.com> [ Upstream commit 7447990137bf06b2aeecad9c6081e01a9f47f2aa ] PCIe r6.2, sec 5.5.4, requires that: If setting either or both of the enable bits for ASPM L1 PM Substates, both ports must be configured as described in this section while ASPM L1 is disabled. Previously, pcie_config_aspm_l1ss() assumed that "setting enable bits" meant "setting them to 1", and it configured L1SS as follows: - Clear L1SS enable bits - Disable L1 - Configure L1SS enable bits as required - Enable L1 if required With this sequence, when disabling L1SS on an ARM A-core with a Synopsys DesignWare PCIe core, the CPU occasionally hangs when reading PCI_L1SS_CTL1, leading to a reboot when the CPU watchdog expires. Move the L1 disable to the caller (pcie_config_aspm_link(), where L1 was already enabled) so L1 is always disabled while updating the L1SS bits: - Disable L1 - Clear L1SS enable bits - Configure L1SS enable bits as required - Enable L1 if required Change pcie_aspm_cap_init() similarly. Link: https://lore.kernel.org/r/20241007032917.872262-1-ajayagarwal@google.com Signed-off-by: Ajay Agarwal <ajayagarwal(a)google.com> [bhelgaas: comments, commit log, compute L1SS setting before config access] Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com> Tested-by: Johnny-CC Chang <Johnny-CC.Chang(a)mediatek.com> Signed-off-by: Macpaul Lin <macpaul.lin(a)mediatek.com> --- drivers/pci/pcie/aspm.c | 92 ++++++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 42 deletions(-) diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index cee2365e54b8..e943691bc931 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -805,6 +805,15 @@ static void pcie_aspm_cap_init(struct pcie_link_state *link, int blacklist) pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &parent_lnkctl); pcie_capability_read_word(child, PCI_EXP_LNKCTL, &child_lnkctl); + /* Disable L0s/L1 before updating L1SS config */ + if (FIELD_GET(PCI_EXP_LNKCTL_ASPMC, child_lnkctl) || + FIELD_GET(PCI_EXP_LNKCTL_ASPMC, parent_lnkctl)) { + pcie_capability_write_word(child, PCI_EXP_LNKCTL, + child_lnkctl & ~PCI_EXP_LNKCTL_ASPMC); + pcie_capability_write_word(parent, PCI_EXP_LNKCTL, + parent_lnkctl & ~PCI_EXP_LNKCTL_ASPMC); + } + /* * Setup L0s state * @@ -829,6 +838,13 @@ static void pcie_aspm_cap_init(struct pcie_link_state *link, int blacklist) aspm_l1ss_init(link); + /* Restore L0s/L1 if they were enabled */ + if (FIELD_GET(PCI_EXP_LNKCTL_ASPMC, child_lnkctl) || + FIELD_GET(PCI_EXP_LNKCTL_ASPMC, parent_lnkctl)) { + pcie_capability_write_word(parent, PCI_EXP_LNKCTL, parent_lnkctl); + pcie_capability_write_word(child, PCI_EXP_LNKCTL, child_lnkctl); + } + /* Save default state */ link->aspm_default = link->aspm_enabled; @@ -845,25 +861,28 @@ static void pcie_aspm_cap_init(struct pcie_link_state *link, int blacklist) } } -/* Configure the ASPM L1 substates */ +/* Configure the ASPM L1 substates. Caller must disable L1 first. */ static void pcie_config_aspm_l1ss(struct pcie_link_state *link, u32 state) { - u32 val, enable_req; + u32 val; struct pci_dev *child = link->downstream, *parent = link->pdev; - enable_req = (link->aspm_enabled ^ state) & state; + val = 0; + if (state & PCIE_LINK_STATE_L1_1) + val |= PCI_L1SS_CTL1_ASPM_L1_1; + if (state & PCIE_LINK_STATE_L1_2) + val |= PCI_L1SS_CTL1_ASPM_L1_2; + if (state & PCIE_LINK_STATE_L1_1_PCIPM) + val |= PCI_L1SS_CTL1_PCIPM_L1_1; + if (state & PCIE_LINK_STATE_L1_2_PCIPM) + val |= PCI_L1SS_CTL1_PCIPM_L1_2; /* - * Here are the rules specified in the PCIe spec for enabling L1SS: - * - When enabling L1.x, enable bit at parent first, then at child - * - When disabling L1.x, disable bit at child first, then at parent - * - When enabling ASPM L1.x, need to disable L1 - * (at child followed by parent). - * - The ASPM/PCIPM L1.2 must be disabled while programming timing + * PCIe r6.2, sec 5.5.4, rules for enabling L1 PM Substates: + * - Clear L1.x enable bits at child first, then at parent + * - Set L1.x enable bits at parent first, then at child + * - ASPM/PCIPM L1.2 must be disabled while programming timing * parameters - * - * To keep it simple, disable all L1SS bits first, and later enable - * what is needed. */ /* Disable all L1 substates */ @@ -871,26 +890,6 @@ static void pcie_config_aspm_l1ss(struct pcie_link_state *link, u32 state) PCI_L1SS_CTL1_L1SS_MASK, 0); pci_clear_and_set_config_dword(parent, parent->l1ss + PCI_L1SS_CTL1, PCI_L1SS_CTL1_L1SS_MASK, 0); - /* - * If needed, disable L1, and it gets enabled later - * in pcie_config_aspm_link(). - */ - if (enable_req & (PCIE_LINK_STATE_L1_1 | PCIE_LINK_STATE_L1_2)) { - pcie_capability_clear_word(child, PCI_EXP_LNKCTL, - PCI_EXP_LNKCTL_ASPM_L1); - pcie_capability_clear_word(parent, PCI_EXP_LNKCTL, - PCI_EXP_LNKCTL_ASPM_L1); - } - - val = 0; - if (state & PCIE_LINK_STATE_L1_1) - val |= PCI_L1SS_CTL1_ASPM_L1_1; - if (state & PCIE_LINK_STATE_L1_2) - val |= PCI_L1SS_CTL1_ASPM_L1_2; - if (state & PCIE_LINK_STATE_L1_1_PCIPM) - val |= PCI_L1SS_CTL1_PCIPM_L1_1; - if (state & PCIE_LINK_STATE_L1_2_PCIPM) - val |= PCI_L1SS_CTL1_PCIPM_L1_2; /* Enable what we need to enable */ pci_clear_and_set_config_dword(parent, parent->l1ss + PCI_L1SS_CTL1, @@ -937,21 +936,30 @@ static void pcie_config_aspm_link(struct pcie_link_state *link, u32 state) dwstream |= PCI_EXP_LNKCTL_ASPM_L1; } + /* + * Per PCIe r6.2, sec 5.5.4, setting either or both of the enable + * bits for ASPM L1 PM Substates must be done while ASPM L1 is + * disabled. Disable L1 here and apply new configuration after L1SS + * configuration has been completed. + * + * Per sec 7.5.3.7, when disabling ASPM L1, software must disable + * it in the Downstream component prior to disabling it in the + * Upstream component, and ASPM L1 must be enabled in the Upstream + * component prior to enabling it in the Downstream component. + * + * Sec 7.5.3.7 also recommends programming the same ASPM Control + * value for all functions of a multi-function device. + */ + list_for_each_entry(child, &linkbus->devices, bus_list) + pcie_config_aspm_dev(child, 0); + pcie_config_aspm_dev(parent, 0); + if (link->aspm_capable & PCIE_LINK_STATE_L1SS) pcie_config_aspm_l1ss(link, state); - /* - * Spec 2.0 suggests all functions should be configured the - * same setting for ASPM. Enabling ASPM L1 should be done in - * upstream component first and then downstream, and vice - * versa for disabling ASPM L1. Spec doesn't mention L0S. - */ - if (state & PCIE_LINK_STATE_L1) - pcie_config_aspm_dev(parent, upstream); + pcie_config_aspm_dev(parent, upstream); list_for_each_entry(child, &linkbus->devices, bus_list) pcie_config_aspm_dev(child, dwstream); - if (!(state & PCIE_LINK_STATE_L1)) - pcie_config_aspm_dev(parent, upstream); link->aspm_enabled = state; -- 2.45.2

2 days, 2 hours

1
0
0 0

[PATCH 6.12.y] block: fix adding folio to bio

by alvalan9＠foxmail.com

From: Ming Lei <ming.lei(a)redhat.com> [ Upstream commit 26064d3e2b4d9a14df1072980e558c636fb023ea ] >4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure. Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully. Fixes: ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <kundan.kumar(a)samsung.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Luis Chamberlain <mcgrof(a)kernel.org> Cc: Gavin Shan <gshan(a)redhat.com> Signed-off-by: Ming Lei <ming.lei(a)redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org> Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe(a)kernel.dk> Signed-off-by: Alva Lan <alvalan9(a)foxmail.com> --- The follow-up fix fbecd731de05 ("xfs: fix zoned GC data corruption due to wrong bv_offset") addresses issues in the file fs/xfs/xfs_zone_gc.c. This file was first introduced in version v6.15-rc1. So don't backport the follow up fix to 6.12.y. --- block/bio.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/block/bio.c b/block/bio.c index 20c74696bf23..094a5adf79d2 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1156,9 +1156,10 @@ EXPORT_SYMBOL(bio_add_page); void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len, size_t off) { + unsigned long nr = off / PAGE_SIZE; + WARN_ON_ONCE(len > UINT_MAX); - WARN_ON_ONCE(off > UINT_MAX); - __bio_add_page(bio, &folio->page, len, off); + __bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE); } EXPORT_SYMBOL_GPL(bio_add_folio_nofail); @@ -1179,9 +1180,11 @@ EXPORT_SYMBOL_GPL(bio_add_folio_nofail); bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len, size_t off) { - if (len > UINT_MAX || off > UINT_MAX) + unsigned long nr = off / PAGE_SIZE; + + if (len > UINT_MAX) return false; - return bio_add_page(bio, &folio->page, len, off) > 0; + return bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE) > 0; } EXPORT_SYMBOL(bio_add_folio); -- 2.34.1

2 days, 3 hours

1
0
0 0

[PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache

by Kairui Song

From: Kairui Song <kasong(a)tencent.com> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get_folio(S1) // Got NULL ... < interrupted again> ... <swapin folio A and free S1> <swapout folio A re-using S1> move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // folio A is ignored !!! Fix this by checking the swap cache again after acquiring the src_pte lock. And to avoid the filemap overhead, we check swap_map directly [2]. The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far we don't need to worry about that, since folios can only be exposed to the swap cache in the swap out path, and this is covered in this patch by checking the swap cache again after acquiring the src_pte lock. Testing with a simple C program that allocates and moves several GB of memory did not show any observable performance change. Cc: <stable(a)vger.kernel.org> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoS… [1] Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8… [2] Signed-off-by: Kairui Song <kasong(a)tencent.com> Reviewed-by: Lokesh Gidra <lokeshgidra(a)google.com> --- V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ Changes: - Check swap_map instead of doing a filemap lookup after acquiring the PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/ Changes: - Move the folio and swap check inside move_swap_pte to avoid skipping the check and potential overhead [ Lokesh Gidra ] - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated value. V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/ Changes: - Add more comments and more context in commit message. mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..8253978ee0fb 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, pte_t orig_dst_pte, pte_t orig_src_pte, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl, - struct folio *src_folio) + struct folio *src_folio, + struct swap_info_struct *si, swp_entry_t entry) { + /* + * Check if the folio still belongs to the target swap entry after + * acquiring the lock. Folio can be freed in the swap cache while + * not locked. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val != src_folio->swap.val)) + return -EAGAIN; + double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (src_folio) { folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); + } else { + /* + * Check if the swap entry is cached after acquiring the src_pte + * lock. Otherwise, we might miss a newly loaded swap cache folio. + * + * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. + * We are trying to catch newly added swap cache, the only possible case is + * when a folio is swapped in and out again staying in swap cache, using the + * same entry before the PTE check above. The PTL is acquired and released + * twice, each time after updating the swap_map's flag. So holding + * the PTL here ensures we see the updated value. False positive is possible, + * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the + * cache, or during the tiny synchronization window between swap cache and + * swap_map, but it will be gone very quickly, worst result is retry jitters. + */ + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); @@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, - dst_ptl, src_ptl, src_folio); + dst_ptl, src_ptl, src_folio, si, entry); } out: -- 2.49.0

2 days, 4 hours

5
4
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror