March 2025 - Linux-stable-mirror

[merged mm-hotfixes-stable] userfaultfd-do-not-block-on-locking-a-large-folio-with-raised-refcount.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: userfaultfd: do not block on locking a large folio with raised refcount has been removed from the -mm tree. Its filename was userfaultfd-do-not-block-on-locking-a-large-folio-with-raised-refcount.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Suren Baghdasaryan <surenb(a)google.com> Subject: userfaultfd: do not block on locking a large folio with raised refcount Date: Wed, 26 Feb 2025 10:55:08 -0800 Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised folio refcount. split_folio() expects the reference count to be exactly mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN otherwise. If multiple processes are trying to move the same large folio, they raise the refcount (all tasks succeed in that) then one of them succeeds in locking the folio, while others will block in folio_lock() while keeping the refcount raised. The winner of this race will proceed with calling split_folio() and will fail returning EAGAIN to the caller and unlocking the folio. The next competing process will get the folio locked and will go through the same flow. In the meantime the original winner will be retried and will block in folio_lock(), getting into the queue of waiting processes only to repeat the same path. All this results in a livelock. An easy fix would be to avoid waiting for the folio lock while holding folio refcount, similar to madvise_free_huge_pmd() where folio lock is acquired before raising the folio refcount. Since we lock and take a refcount of the folio while holding the PTE lock, changing the order of these operations should not break anything. Modify move_pages_pte() to try locking the folio first and if that fails and the folio is large then return EAGAIN without touching the folio refcount. If the folio is single-page then split_folio() is not called, so we don't have this issue. Lokesh has a reproducer [1] and I verified that this change fixes the issue. [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock [akpm(a)linux-foundation.org: reflow comment to 80 cols, s/end/end up/] Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb(a)google.com> Reported-by: Lokesh Gidra <lokeshgidra(a)google.com> Reviewed-by: Peter Xu <peterx(a)redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Barry Song <21cnbao(a)gmail.com> Cc: Barry Song <v-songbaohua(a)oppo.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jann Horn <jannh(a)google.com> Cc: Kalesh Singh <kaleshsingh(a)google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/userfaultfd.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) --- a/mm/userfaultfd.c~userfaultfd-do-not-block-on-locking-a-large-folio-with-raised-refcount +++ a/mm/userfaultfd.c @@ -1250,6 +1250,7 @@ retry: */ if (!src_folio) { struct folio *folio; + bool locked; /* * Pin the page while holding the lock to be sure the @@ -1269,12 +1270,26 @@ retry: goto out; } + locked = folio_trylock(folio); + /* + * We avoid waiting for folio lock with a raised + * refcount for large folios because extra refcounts + * will result in split_folio() failing later and + * retrying. If multiple tasks are trying to move a + * large folio we can end up livelocking. + */ + if (!locked && folio_test_large(folio)) { + spin_unlock(src_ptl); + err = -EAGAIN; + goto out; + } + folio_get(folio); src_folio = folio; src_folio_pte = orig_src_pte; spin_unlock(src_ptl); - if (!folio_trylock(src_folio)) { + if (!locked) { pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte = dst_pte = NULL; _ Patches currently in -mm which might be from surenb(a)google.com are mm-avoid-extra-mem_alloc_profiling_enabled-checks.patch alloc_tag-uninline-code-gated-by-mem_alloc_profiling_key-in-slab-allocator.patch alloc_tag-uninline-code-gated-by-mem_alloc_profiling_key-in-page-allocator.patch mm-introduce-vma_start_read_locked_nested-helpers.patch mm-move-per-vma-lock-into-vm_area_struct.patch mm-mark-vma-as-detached-until-its-added-into-vma-tree.patch mm-introduce-vma_iter_store_attached-to-use-with-attached-vmas.patch mm-mark-vmas-detached-upon-exit.patch types-move-struct-rcuwait-into-typesh.patch mm-allow-vma_start_read_locked-vma_start_read_locked_nested-to-fail.patch mm-move-mmap_init_lock-out-of-the-header-file.patch mm-uninline-the-main-body-of-vma_start_write.patch refcount-provide-ops-for-cases-when-objects-memory-can-be-reused.patch refcount-provide-ops-for-cases-when-objects-memory-can-be-reused-fix.patch refcount-introduce-__refcount_addinc_not_zero_limited_acquire.patch mm-replace-vm_lock-and-detached-flag-with-a-reference-count.patch mm-replace-vm_lock-and-detached-flag-with-a-reference-count-fix.patch mm-move-lesser-used-vma_area_struct-members-into-the-last-cacheline.patch mm-debug-print-vm_refcnt-state-when-dumping-the-vma.patch mm-remove-extra-vma_numab_state_init-call.patch mm-prepare-lock_vma_under_rcu-for-vma-reuse-possibility.patch mm-make-vma-cache-slab_typesafe_by_rcu.patch mm-make-vma-cache-slab_typesafe_by_rcu-fix.patch docs-mm-document-latest-changes-to-vm_lock.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] mm-shmem-fix-potential-data-corruption-during-shmem-swapin.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: shmem: fix potential data corruption during shmem swapin has been removed from the -mm tree. Its filename was mm-shmem-fix-potential-data-corruption-during-shmem-swapin.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Baolin Wang <baolin.wang(a)linux.alibaba.com> Subject: mm: shmem: fix potential data corruption during shmem swapin Date: Tue, 25 Feb 2025 17:52:55 +0800 Alex and Kairui reported some issues (system hang or data corruption) when swapping out or swapping in large shmem folios. This is especially easy to reproduce when the tmpfs is mount with the 'huge=within_size' parameter. Thanks to Kairui's reproducer, the issue can be easily replicated. The root cause of the problem is that swap readahead may asynchronously swap in order 0 folios into the swap cache, while the shmem mapping can still store large swap entries. Then an order 0 folio is inserted into the shmem mapping without splitting the large swap entry, which overwrites the original large swap entry, leading to data corruption. When getting a folio from the swap cache, we should split the large swap entry stored in the shmem mapping if the orders do not match, to fix this issue. Link: https://lkml.kernel.org/r/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.17404769… Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com> Reported-by: Alex Xu (Hello71) <alex_y_xu(a)yahoo.ca> Reported-by: Kairui Song <ryncsn(a)gmail.com> Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/ Tested-by: Kairui Song <kasong(a)tencent.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Lance Yang <ioworker0(a)gmail.com> Cc: Matthew Wilcow <willy(a)infradead.org> Cc: Hugh Dickins <hughd(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/shmem.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) --- a/mm/shmem.c~mm-shmem-fix-potential-data-corruption-during-shmem-swapin +++ a/mm/shmem.c @@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct ino struct folio *folio = NULL; bool skip_swapcache = false; swp_entry_t swap; - int error, nr_pages; + int error, nr_pages, order, split_order; VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); swap = radix_to_swp_entry(*foliop); @@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct ino /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap, NULL, 0); + order = xa_get_order(&mapping->i_pages, index); if (!folio) { - int order = xa_get_order(&mapping->i_pages, index); bool fallback_order0 = false; - int split_order; /* Or update major stats only when swapin succeeds?? */ if (fault_type) { @@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct ino error = -ENOMEM; goto failed; } + } else if (order != folio_order(folio)) { + /* + * Swap readahead may swap in order 0 folios into swapcache + * asynchronously, while the shmem mapping can still stores + * large swap entries. In such cases, we should split the + * large swap entry to prevent possible data corruption. + */ + split_order = shmem_split_large_entry(inode, index, swap, gfp); + if (split_order < 0) { + error = split_order; + goto failed; + } + + /* + * If the large swap entry has already been split, it is + * necessary to recalculate the new swap entry based on + * the old order alignment. + */ + if (split_order > 0) { + pgoff_t offset = index - round_down(index, 1 << split_order); + + swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); + } } alloced: @@ -2346,7 +2368,8 @@ alloced: folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap)) { + !shmem_confirm_swap(mapping, index, swap) || + xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { error = -EEXIST; goto unlock; } _ Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are mm-shmem-drop-the-unused-macro.patch mm-shmem-remove-fadvise-comments.patch mm-shmem-remove-duplicate-error-validation.patch mm-shmem-change-the-return-value-of-shmem_find_swap_entries.patch mm-shmem-factor-out-the-within_size-logic-into-a-new-helper.patch maintainers-add-myself-as-shmem-reviewer.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] mm-fix-kernel-bug-when-userfaultfd_move-encounters-swapcache.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: fix kernel BUG when userfaultfd_move encounters swapcache has been removed from the -mm tree. Its filename was mm-fix-kernel-bug-when-userfaultfd_move-encounters-swapcache.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Barry Song <v-songbaohua(a)oppo.com> Subject: mm: fix kernel BUG when userfaultfd_move encounters swapcache Date: Wed, 26 Feb 2025 13:14:00 +1300 userfaultfd_move() checks whether the PTE entry is present or a swap entry. - If the PTE entry is present, move_present_pte() handles folio migration by setting: src_folio->index = linear_page_index(dst_vma, dst_addr); - If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr. This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache. This creates a race window between steps 2 and 4. 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() occurs in the window between steps 2 and 4, after the swap PTE has been moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swapcache. However, since the folio's index has not been updated to match the destination VMA, do_swap_page() will detect a mismatch. This can result in two critical issues depending on the system configuration. If KSM is disabled, both small and large folios can trigger a BUG during the add_rmap operation due to: page_pgoff(folio, page) != linear_page_index(vma, address) [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2 If KSM is enabled, Peter Xu also discovered that do_swap_page() may trigger an unexpected CoW operation for small folios because ksm_might_need_to_copy() allocates a new folio when the folio index does not match linear_page_index(vma, addr). This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock. Note that for large folios, in the swapcache handling path, we directly return -EBUSY since split_folio() will return -EBUSY regardless if the folio is under writeback or unmapped. This is not an urgent issue, so a follow-up patch may address it separately. [v-songbaohua(a)oppo.com: minor cleanup according to Peter Xu] Link: https://lkml.kernel.org/r/20250226024411.47092-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250226001400.9129-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua(a)oppo.com> Acked-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Suren Baghdasaryan <surenb(a)google.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Al Viro <viro(a)zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen(a)google.com> Cc: Brian Geffon <bgeffon(a)google.com> Cc: Christian Brauner <brauner(a)kernel.org> Cc: David Hildenbrand <david(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jann Horn <jannh(a)google.com> Cc: Kalesh Singh <kaleshsingh(a)google.com> Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com> Cc: Lokesh Gidra <lokeshgidra(a)google.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Mike Rapoport (IBM) <rppt(a)kernel.org> Cc: Nicolas Geoffray <ngeoffray(a)google.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: ZhangPeng <zhangpeng362(a)huawei.com> Cc: Tangquan Zheng <zhengtangquan(a)oppo.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/userfaultfd.c | 74 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 66 insertions(+), 8 deletions(-) --- a/mm/userfaultfd.c~mm-fix-kernel-bug-when-userfaultfd_move-encounters-swapcache +++ a/mm/userfaultfd.c @@ -18,6 +18,7 @@ #include <asm/tlbflush.h> #include <asm/tlb.h> #include "internal.h" +#include "swap.h" static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) @@ -1076,16 +1077,14 @@ out: return err; } -static int move_swap_pte(struct mm_struct *mm, +static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, pte_t *dst_pte, pte_t *src_pte, pte_t orig_dst_pte, pte_t orig_src_pte, pmd_t *dst_pmd, pmd_t dst_pmdval, - spinlock_t *dst_ptl, spinlock_t *src_ptl) + spinlock_t *dst_ptl, spinlock_t *src_ptl, + struct folio *src_folio) { - if (!pte_swp_exclusive(orig_src_pte)) - return -EBUSY; - double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1094,6 +1093,16 @@ static int move_swap_pte(struct mm_struc return -EAGAIN; } + /* + * The src_folio resides in the swapcache, requiring an update to its + * index and mapping to align with the dst_vma, where a swap-in may + * occur and hit the swapcache after moving the PTE. + */ + if (src_folio) { + folio_move_anon_rmap(src_folio, dst_vma); + src_folio->index = linear_page_index(dst_vma, dst_addr); + } + orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); set_pte_at(mm, dst_addr, dst_pte, orig_src_pte); double_pt_unlock(dst_ptl, src_ptl); @@ -1141,6 +1150,7 @@ static int move_pages_pte(struct mm_stru __u64 mode) { swp_entry_t entry; + struct swap_info_struct *si = NULL; pte_t orig_src_pte, orig_dst_pte; pte_t src_folio_pte; spinlock_t *src_ptl, *dst_ptl; @@ -1322,6 +1332,8 @@ retry: orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, dst_ptl, src_ptl, src_folio); } else { + struct folio *folio = NULL; + entry = pte_to_swp_entry(orig_src_pte); if (non_swap_entry(entry)) { if (is_migration_entry(entry)) { @@ -1335,9 +1347,53 @@ retry: goto out; } - err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte, - orig_dst_pte, orig_src_pte, dst_pmd, - dst_pmdval, dst_ptl, src_ptl); + if (!pte_swp_exclusive(orig_src_pte)) { + err = -EBUSY; + goto out; + } + + si = get_swap_device(entry); + if (unlikely(!si)) { + err = -EAGAIN; + goto out; + } + /* + * Verify the existence of the swapcache. If present, the folio's + * index and mapping must be updated even when the PTE is a swap + * entry. The anon_vma lock is not taken during this process since + * the folio has already been unmapped, and the swap entry is + * exclusive, preventing rmap walks. + * + * For large folios, return -EBUSY immediately, as split_folio() + * also returns -EBUSY when attempting to split unmapped large + * folios in the swapcache. This issue needs to be resolved + * separately to allow proper handling. + */ + if (!src_folio) + folio = filemap_get_folio(swap_address_space(entry), + swap_cache_index(entry)); + if (!IS_ERR_OR_NULL(folio)) { + if (folio_test_large(folio)) { + err = -EBUSY; + folio_put(folio); + goto out; + } + src_folio = folio; + src_folio_pte = orig_src_pte; + if (!folio_trylock(src_folio)) { + pte_unmap(&orig_src_pte); + pte_unmap(&orig_dst_pte); + src_pte = dst_pte = NULL; + put_swap_device(si); + si = NULL; + /* now we can block and wait */ + folio_lock(src_folio); + goto retry; + } + } + err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, + orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, + dst_ptl, src_ptl, src_folio); } out: @@ -1354,6 +1410,8 @@ out: if (src_pte) pte_unmap(src_pte); mmu_notifier_invalidate_range_end(&range); + if (si) + put_swap_device(si); return err; } _ Patches currently in -mm which might be from v-songbaohua(a)oppo.com are mm-set-folio-swapbacked-iff-folios-are-dirty-in-try_to_unmap_one.patch mm-support-tlbbatch-flush-for-a-range-of-ptes.patch mm-support-batched-unmap-for-lazyfree-large-folios-during-reclamation.patch mm-avoid-splitting-pmd-for-lazyfree-pmd-mapped-thp-in-try_to_unmap.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] selftests-damon-damon_nr_regions-sort-collected-regiosn-before-checking-with-min-max-boundaries.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries has been removed from the -mm tree. Its filename was selftests-damon-damon_nr_regions-sort-collected-regiosn-before-checking-with-min-max-boundaries.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: SeongJae Park <sj(a)kernel.org> Subject: selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries Date: Tue, 25 Feb 2025 14:23:33 -0800 damon_nr_regions.py starts DAMON, periodically collect number of regions in snapshots, and see if it is in the requested range. The check code assumes the numbers are sorted on the collection list, but there is no such guarantee. Hence this can result in false positive test success. Sort the list before doing the check. Link: https://lkml.kernel.org/r/20250225222333.505646-4-sj@kernel.org Fixes: 781497347d1b ("selftests/damon: implement test for min/max_nr_regions") Signed-off-by: SeongJae Park <sj(a)kernel.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- tools/testing/selftests/damon/damon_nr_regions.py | 1 + 1 file changed, 1 insertion(+) --- a/tools/testing/selftests/damon/damon_nr_regions.py~selftests-damon-damon_nr_regions-sort-collected-regiosn-before-checking-with-min-max-boundaries +++ a/tools/testing/selftests/damon/damon_nr_regions.py @@ -65,6 +65,7 @@ def test_nr_regions(real_nr_regions, min test_name = 'nr_regions test with %d/%d/%d real/min/max nr_regions' % ( real_nr_regions, min_nr_regions, max_nr_regions) + collected_nr_regions.sort() if (collected_nr_regions[0] < min_nr_regions or collected_nr_regions[-1] > max_nr_regions): print('fail %s' % test_name) _ Patches currently in -mm which might be from sj(a)kernel.org are mm-damon-respect-core-layer-filters-allowance-decision-on-ops-layer.patch mm-damon-core-initialize-damos-walk_completed-in-damon_new_scheme.patch mm-madvise-split-out-mmap-locking-operations-for-madvise.patch mm-madvise-split-out-madvise-input-validity-check.patch mm-madvise-split-out-madvise-behavior-execution.patch mm-madvise-remove-redundant-mmap_lock-operations-from-process_madvise.patch mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times.patch mm-damon-core-unset-damos-walk_completed-after-confimed-set.patch mm-damon-core-do-not-call-damos_walk_control-walk-if-walk-is-completed.patch mm-damon-core-do-damos-walking-in-entire-regions-granularity.patch mm-damon-introduce-damos-filter-type-hugepage_size-fix.patch docs-mm-damon-design-fix-typo-on-damos-filters-usage-doc-link.patch docs-mm-damon-design-document-hugepage_size-filter.patch docs-damon-move-damos-filter-type-names-and-meaning-to-design-doc.patch docs-mm-damon-design-clarify-handling-layer-based-filters-evaluation-sequence.patch docs-mm-damon-design-categorize-damos-filter-types-based-on-handling-layer.patch mm-damon-implement-a-new-damos-filter-type-for-unmapped-pages.patch docs-mm-damon-design-document-unmapped-damos-filter-type.patch mm-damon-add-data-structure-for-monitoring-intervals-auto-tuning.patch mm-damon-core-implement-intervals-auto-tuning.patch mm-damon-sysfs-implement-intervals-tuning-goal-directory.patch mm-damon-sysfs-commit-intervals-tuning-goal.patch mm-damon-sysfs-implement-a-command-to-update-auto-tuned-monitoring-intervals.patch docs-mm-damon-design-document-for-intervals-auto-tuning.patch docs-mm-damon-design-document-for-intervals-auto-tuning-fix.patch docs-abi-damon-document-intervals-auto-tuning-abi.patch docs-admin-guide-mm-damon-usage-add-intervals_goal-directory-on-the-hierarchy.patch mm-damon-core-introduce-damos-ops_filters.patch mm-damon-paddr-support-ops_filters.patch mm-damon-core-support-committing-ops_filters.patch mm-damon-core-put-ops-handled-filters-to-damos-ops_filters.patch mm-damon-paddr-support-only-damos-ops_filters.patch mm-damon-add-default-allow-reject-behavior-fields-to-struct-damos.patch mm-damon-core-set-damos_filter-default-allowance-behavior-based-on-installed-filters.patch mm-damon-paddr-respect-ops_filters_default_reject.patch docs-mm-damon-design-update-for-changed-filter-default-behavior.patch mm-damon-sysfs-schemes-let-damon_sysfs_scheme_set_filters-be-used-for-different-named-directories.patch mm-damon-sysfs-schemes-implement-core_filters-and-ops_filters-directories.patch mm-damon-sysfs-schemes-commit-filters-in-coreops_filters-directories.patch mm-damon-core-expose-damos_filter_for_ops-to-damon-kernel-api-callers.patch mm-damon-sysfs-schemes-record-filters-of-which-layer-should-be-added-to-the-given-filters-directory.patch mm-damon-sysfs-schemes-return-error-when-for-attempts-to-install-filters-on-wrong-sysfs-directory.patch docs-abi-damon-document-coreops_filters-directories.patch docs-admin-guide-mm-damon-usage-update-for-coreops_filters-directories.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] selftests-damon-damon_nr_regions-set-ops-update-for-merge-results-check-to-100ms.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms has been removed from the -mm tree. Its filename was selftests-damon-damon_nr_regions-set-ops-update-for-merge-results-check-to-100ms.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: SeongJae Park <sj(a)kernel.org> Subject: selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms Date: Tue, 25 Feb 2025 14:23:32 -0800 damon_nr_regions.py updates max_nr_regions to a number smaller than expected number of real regions and confirms DAMON respect the harsh limit. To give time for DAMON to make changes for the regions, 3 aggregation intervals (300 milliseconds) are given. The internal mechanism works with not only the max_nr_regions, but also sz_limit, though. It avoids merging region if that casn make region of size larger than sz_limit. In the test, sz_limit is set too small to achive the new max_nr_regions, unless it is updated for the new min_nr_regions. But the update is done only once per operations set update interval, which is one second by default. Hence, the test randomly incurs false positive failures. Fix it by setting the ops interval same to aggregation interval, to make sure sz_limit is updated by the time of the check. Link: https://lkml.kernel.org/r/20250225222333.505646-3-sj@kernel.org Fixes: 8bf890c81612 ("selftests/damon/damon_nr_regions: test online-tuned max_nr_regions") Signed-off-by: SeongJae Park <sj(a)kernel.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- tools/testing/selftests/damon/damon_nr_regions.py | 1 + 1 file changed, 1 insertion(+) --- a/tools/testing/selftests/damon/damon_nr_regions.py~selftests-damon-damon_nr_regions-set-ops-update-for-merge-results-check-to-100ms +++ a/tools/testing/selftests/damon/damon_nr_regions.py @@ -109,6 +109,7 @@ def main(): attrs = kdamonds.kdamonds[0].contexts[0].monitoring_attrs attrs.min_nr_regions = 3 attrs.max_nr_regions = 7 + attrs.update_us = 100000 err = kdamonds.kdamonds[0].commit() if err is not None: proc.terminate() _ Patches currently in -mm which might be from sj(a)kernel.org are mm-damon-respect-core-layer-filters-allowance-decision-on-ops-layer.patch mm-damon-core-initialize-damos-walk_completed-in-damon_new_scheme.patch mm-madvise-split-out-mmap-locking-operations-for-madvise.patch mm-madvise-split-out-madvise-input-validity-check.patch mm-madvise-split-out-madvise-behavior-execution.patch mm-madvise-remove-redundant-mmap_lock-operations-from-process_madvise.patch mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times.patch mm-damon-core-unset-damos-walk_completed-after-confimed-set.patch mm-damon-core-do-not-call-damos_walk_control-walk-if-walk-is-completed.patch mm-damon-core-do-damos-walking-in-entire-regions-granularity.patch mm-damon-introduce-damos-filter-type-hugepage_size-fix.patch docs-mm-damon-design-fix-typo-on-damos-filters-usage-doc-link.patch docs-mm-damon-design-document-hugepage_size-filter.patch docs-damon-move-damos-filter-type-names-and-meaning-to-design-doc.patch docs-mm-damon-design-clarify-handling-layer-based-filters-evaluation-sequence.patch docs-mm-damon-design-categorize-damos-filter-types-based-on-handling-layer.patch mm-damon-implement-a-new-damos-filter-type-for-unmapped-pages.patch docs-mm-damon-design-document-unmapped-damos-filter-type.patch mm-damon-add-data-structure-for-monitoring-intervals-auto-tuning.patch mm-damon-core-implement-intervals-auto-tuning.patch mm-damon-sysfs-implement-intervals-tuning-goal-directory.patch mm-damon-sysfs-commit-intervals-tuning-goal.patch mm-damon-sysfs-implement-a-command-to-update-auto-tuned-monitoring-intervals.patch docs-mm-damon-design-document-for-intervals-auto-tuning.patch docs-mm-damon-design-document-for-intervals-auto-tuning-fix.patch docs-abi-damon-document-intervals-auto-tuning-abi.patch docs-admin-guide-mm-damon-usage-add-intervals_goal-directory-on-the-hierarchy.patch mm-damon-core-introduce-damos-ops_filters.patch mm-damon-paddr-support-ops_filters.patch mm-damon-core-support-committing-ops_filters.patch mm-damon-core-put-ops-handled-filters-to-damos-ops_filters.patch mm-damon-paddr-support-only-damos-ops_filters.patch mm-damon-add-default-allow-reject-behavior-fields-to-struct-damos.patch mm-damon-core-set-damos_filter-default-allowance-behavior-based-on-installed-filters.patch mm-damon-paddr-respect-ops_filters_default_reject.patch docs-mm-damon-design-update-for-changed-filter-default-behavior.patch mm-damon-sysfs-schemes-let-damon_sysfs_scheme_set_filters-be-used-for-different-named-directories.patch mm-damon-sysfs-schemes-implement-core_filters-and-ops_filters-directories.patch mm-damon-sysfs-schemes-commit-filters-in-coreops_filters-directories.patch mm-damon-core-expose-damos_filter_for_ops-to-damon-kernel-api-callers.patch mm-damon-sysfs-schemes-record-filters-of-which-layer-should-be-added-to-the-given-filters-directory.patch mm-damon-sysfs-schemes-return-error-when-for-attempts-to-install-filters-on-wrong-sysfs-directory.patch docs-abi-damon-document-coreops_filters-directories.patch docs-admin-guide-mm-damon-usage-update-for-coreops_filters-directories.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] selftests-damon-damos_quota-make-real-expectation-of-quota-exceeds.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: selftests/damon/damos_quota: make real expectation of quota exceeds has been removed from the -mm tree. Its filename was selftests-damon-damos_quota-make-real-expectation-of-quota-exceeds.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: SeongJae Park <sj(a)kernel.org> Subject: selftests/damon/damos_quota: make real expectation of quota exceeds Date: Tue, 25 Feb 2025 14:23:31 -0800 Patch series "selftests/damon: three fixes for false results". Fix three DAMON selftest bugs that cause two and one false positive failures and successes. This patch (of 3): damos_quota.py assumes the quota will always exceeded. But whether quota will be exceeded or not depend on the monitoring results. Actually the monitored workload has chaning access pattern and hence sometimes the quota may not really be exceeded. As a result, false positive test failures happen. Expect how much time the quota will be exceeded by checking the monitoring results, and use it instead of the naive assumption. Link: https://lkml.kernel.org/r/20250225222333.505646-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250225222333.505646-2-sj@kernel.org Fixes: 51f58c9da14b ("selftests/damon: add a test for DAMOS quota") Signed-off-by: SeongJae Park <sj(a)kernel.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- tools/testing/selftests/damon/damos_quota.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) --- a/tools/testing/selftests/damon/damos_quota.py~selftests-damon-damos_quota-make-real-expectation-of-quota-exceeds +++ a/tools/testing/selftests/damon/damos_quota.py @@ -51,16 +51,19 @@ def main(): nr_quota_exceeds = scheme.stats.qt_exceeds wss_collected.sort() + nr_expected_quota_exceeds = 0 for wss in wss_collected: if wss > sz_quota: print('quota is not kept: %s > %s' % (wss, sz_quota)) print('collected samples are as below') print('\n'.join(['%d' % wss for wss in wss_collected])) exit(1) + if wss == sz_quota: + nr_expected_quota_exceeds += 1 - if nr_quota_exceeds < len(wss_collected): - print('quota is not always exceeded: %d > %d' % - (len(wss_collected), nr_quota_exceeds)) + if nr_quota_exceeds < nr_expected_quota_exceeds: + print('quota is exceeded less than expected: %d < %d' % + (nr_quota_exceeds, nr_expected_quota_exceeds)) exit(1) if __name__ == '__main__': _ Patches currently in -mm which might be from sj(a)kernel.org are mm-damon-respect-core-layer-filters-allowance-decision-on-ops-layer.patch mm-damon-core-initialize-damos-walk_completed-in-damon_new_scheme.patch mm-madvise-split-out-mmap-locking-operations-for-madvise.patch mm-madvise-split-out-madvise-input-validity-check.patch mm-madvise-split-out-madvise-behavior-execution.patch mm-madvise-remove-redundant-mmap_lock-operations-from-process_madvise.patch mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times.patch mm-damon-core-unset-damos-walk_completed-after-confimed-set.patch mm-damon-core-do-not-call-damos_walk_control-walk-if-walk-is-completed.patch mm-damon-core-do-damos-walking-in-entire-regions-granularity.patch mm-damon-introduce-damos-filter-type-hugepage_size-fix.patch docs-mm-damon-design-fix-typo-on-damos-filters-usage-doc-link.patch docs-mm-damon-design-document-hugepage_size-filter.patch docs-damon-move-damos-filter-type-names-and-meaning-to-design-doc.patch docs-mm-damon-design-clarify-handling-layer-based-filters-evaluation-sequence.patch docs-mm-damon-design-categorize-damos-filter-types-based-on-handling-layer.patch mm-damon-implement-a-new-damos-filter-type-for-unmapped-pages.patch docs-mm-damon-design-document-unmapped-damos-filter-type.patch mm-damon-add-data-structure-for-monitoring-intervals-auto-tuning.patch mm-damon-core-implement-intervals-auto-tuning.patch mm-damon-sysfs-implement-intervals-tuning-goal-directory.patch mm-damon-sysfs-commit-intervals-tuning-goal.patch mm-damon-sysfs-implement-a-command-to-update-auto-tuned-monitoring-intervals.patch docs-mm-damon-design-document-for-intervals-auto-tuning.patch docs-mm-damon-design-document-for-intervals-auto-tuning-fix.patch docs-abi-damon-document-intervals-auto-tuning-abi.patch docs-admin-guide-mm-damon-usage-add-intervals_goal-directory-on-the-hierarchy.patch mm-damon-core-introduce-damos-ops_filters.patch mm-damon-paddr-support-ops_filters.patch mm-damon-core-support-committing-ops_filters.patch mm-damon-core-put-ops-handled-filters-to-damos-ops_filters.patch mm-damon-paddr-support-only-damos-ops_filters.patch mm-damon-add-default-allow-reject-behavior-fields-to-struct-damos.patch mm-damon-core-set-damos_filter-default-allowance-behavior-based-on-installed-filters.patch mm-damon-paddr-respect-ops_filters_default_reject.patch docs-mm-damon-design-update-for-changed-filter-default-behavior.patch mm-damon-sysfs-schemes-let-damon_sysfs_scheme_set_filters-be-used-for-different-named-directories.patch mm-damon-sysfs-schemes-implement-core_filters-and-ops_filters-directories.patch mm-damon-sysfs-schemes-commit-filters-in-coreops_filters-directories.patch mm-damon-core-expose-damos_filter_for_ops-to-damon-kernel-api-callers.patch mm-damon-sysfs-schemes-record-filters-of-which-layer-should-be-added-to-the-given-filters-directory.patch mm-damon-sysfs-schemes-return-error-when-for-attempts-to-install-filters-on-wrong-sysfs-directory.patch docs-abi-damon-document-coreops_filters-directories.patch docs-admin-guide-mm-damon-usage-update-for-coreops_filters-directories.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback has been removed from the -mm tree. Its filename was nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Mike Snitzer <snitzer(a)kernel.org> Subject: NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback Date: Mon, 24 Feb 2025 21:20:02 -0500 Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm(a)linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer(a)kernel.org> Cc: Anna Schumaker <anna.schumaker(a)oracle.com> Cc: Trond Myklebust <trond.myklebust(a)hammerspace.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/nfs/file.c | 3 ++- include/linux/compaction.h | 5 +++++ include/linux/sched.h | 2 +- mm/compaction.c | 3 +++ 4 files changed, 11 insertions(+), 2 deletions(-) --- a/fs/nfs/file.c~nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback +++ a/fs/nfs/file.c @@ -29,6 +29,7 @@ #include <linux/pagemap.h> #include <linux/gfp.h> #include <linux/swap.h> +#include <linux/compaction.h> #include <linux/uaccess.h> #include <linux/filelock.h> @@ -457,7 +458,7 @@ static bool nfs_release_folio(struct fol /* If the private flag is set, then the folio is not freeable */ if (folio_test_private(folio)) { if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL || - current_is_kswapd()) + current_is_kswapd() || current_is_kcompactd()) return false; if (nfs_wb_folio(folio->mapping->host, folio) < 0) return false; --- a/include/linux/compaction.h~nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback +++ a/include/linux/compaction.h @@ -80,6 +80,11 @@ static inline unsigned long compact_gap( return 2UL << order; } +static inline int current_is_kcompactd(void) +{ + return current->flags & PF_KCOMPACTD; +} + #ifdef CONFIG_COMPACTION extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order); --- a/include/linux/sched.h~nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback +++ a/include/linux/sched.h @@ -1701,7 +1701,7 @@ extern struct pid *cad_pid; #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ #define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ -#define PF__HOLE__00010000 0x00010000 +#define PF_KCOMPACTD 0x00010000 /* I am kcompactd */ #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocations inherit GFP_NOIO. See memalloc_noio_save() */ --- a/mm/compaction.c~nfs-fix-nfs_release_folio-to-not-deadlock-via-kcompactd-writeback +++ a/mm/compaction.c @@ -3181,6 +3181,7 @@ static int kcompactd(void *p) long default_timeout = msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC); long timeout = default_timeout; + current->flags |= PF_KCOMPACTD; set_freezable(); pgdat->kcompactd_max_order = 0; @@ -3237,6 +3238,8 @@ static int kcompactd(void *p) pgdat->proactive_compact_trigger = false; } + current->flags &= ~PF_KCOMPACTD; + return 0; } _ Patches currently in -mm which might be from snitzer(a)kernel.org are

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] mm-abort-vma_modify-on-merge-out-of-memory-failure.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: abort vma_modify() on merge out of memory failure has been removed from the -mm tree. Its filename was mm-abort-vma_modify-on-merge-out-of-memory-failure.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Subject: mm: abort vma_modify() on merge out of memory failure Date: Sat, 22 Feb 2025 16:19:52 +0000 The remainder of vma_modify() relies upon the vmg state remaining pristine after a merge attempt. Usually this is the case, however in the one edge case scenario of a merge attempt failing not due to the specified range being unmergeable, but rather due to an out of memory error arising when attempting to commit the merge, this assumption becomes untrue. This results in vmg->start, end being modified, and thus the proceeding attempts to split the VMA will be done with invalid start/end values. Thankfully, it is likely practically impossible for us to hit this in reality, as it would require a maple tree node pre-allocation failure that would likely never happen due to it being 'too small to fail', i.e. the kernel would simply keep retrying reclaim until it succeeded. However, this scenario remains theoretically possible, and what we are doing here is wrong so we must correct it. The safest option is, when this scenario occurs, to simply give up the operation. If we cannot allocate memory to merge, then we cannot allocate memory to split either (perhaps moreso!). Any scenario where this would be happening would be under very extreme (likely fatal) memory pressure, so it's best we give up early. So there is no doubt it is appropriate to simply bail out in this scenario. However, in general we must if at all possible never assume VMG state is stable after a merge attempt, since merge operations update VMG fields. As a result, additionally also make this clear by storing start, end in local variables. The issue was reported originally by syzkaller, and by Brad Spengler (via an off-list discussion), and in both instances it manifested as a triggering of the assert: VM_WARN_ON_VMG(start >= end, vmg); In vma_merge_existing_range(). It seems at least one scenario in which this is occurring is one in which the merge being attempted is due to an madvise() across multiple VMAs which looks like this: start end |<------>| |----------|------| | vma | next | |----------|------| When madvise_walk_vmas() is invoked, we first find vma in the above (determining prev to be equal to vma as we are offset into vma), and then enter the loop. We determine the end of vma that forms part of the range we are madvise()'ing by setting 'tmp' to this value: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; We then invoke the madvise() operation via visit(), letting prev get updated to point to vma as part of the operation: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where the visit() function pointer in this instance is madvise_vma_behavior(). As observed in syzkaller reports, it is ultimately madvise_update_vma() that is invoked, calling vma_modify_flags_name() and vma_modify() in turn. Then, in vma_modify(), we attempt the merge: merged = vma_merge_existing_range(vmg); if (merged) return merged; We invoke this with vmg->start, end set to start, tmp as such: start tmp |<--->| |----------|------| | vma | next | |----------|------| We find ourselves in the merge right scenario, but the one in which we cannot remove the middle (we are offset into vma). Here we have a special case where vmg->start, end get set to perhaps unintuitive values - we intended to shrink the middle VMA and expand the next. This means vmg->start, end are set to... vma->vm_start, start. Now the commit_merge() fails, and vmg->start, end are left like this. This means we return to the rest of vma_modify() with vmg->start, end (here denoted as start', end') set as: start' end' |<-->| |----------|------| | vma | next | |----------|------| So we now erroneously try to split accordingly. This is where the unfortunate stuff begins. We start with: /* Split any preceding portion of the VMA. */ if (vma->vm_start < vmg->start) { ... } This doesn't trigger as we are no longer offset into vma at the start. But then we invoke: /* Split any trailing portion of the VMA. */ if (vma->vm_end > vmg->end) { ... } Which does get invoked. This leaves us with: start' end' |<-->| |----|-----|------| | vma| new | next | |----|-----|------| We then return ultimately to madvise_walk_vmas(). Here 'new' is unknown, and putting back the values known in this function we are faced with: start tmp end | | | |----|-----|------| | vma| new | next | |----|-----|------| prev Then: start = tmp; So: start end | | |----|-----|------| | vma| new | next | |----|-----|------| prev The following code does not cause anything to happen: if (prev && start < prev->vm_end) start = prev->vm_end; if (start >= end) break; And then we invoke: if (prev) vma = find_vma(mm, prev->vm_end); Which is where a problem occurs - we don't know about 'new' so we essentially look for the vma after prev, which is new, whereas we actually intended to discover next! So we end up with: start end | | |----|-----|------| |prev| vma | next | |----|-----|------| And we have successfully bypassed all of the checks madvise_walk_vmas() has to ensure early exit should we end up moving out of range. We loop around, and hit: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; Oh dear. Now we have: tmp start end | | |----|-----|------| |prev| vma | next | |----|-----|------| We then invoke: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where start == tmp. That is, a zero range. This is not good. We invoke visit() which is madvise_vma_behavior() which does not check the range (for good reason, it assumes all checks have been done before it was called), which in turn finally calls madvise_update_vma(). The madvise_update_vma() function calls vma_modify_flags_name() in turn, which ultimately invokes vma_modify() with... start == end. vma_modify() calls vma_merge_existing_range() and finally we hit: VM_WARN_ON_VMG(start >= end, vmg); Which triggers, as start == end. While it might be useful to add some CONFIG_DEBUG_VM asserts in these instances to catch this kind of error, since we have just eliminated any possibility of that happening, we will add such asserts separately as to reduce churn and aid backporting. Link: https://lkml.kernel.org/r/20250222161952.41957-1-lorenzo.stoakes@oracle.com Fixes: 2f1c6611b0a8 ("mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Tested-by: Brad Spengler <brad.spengler(a)opensrcsec.com> Reported-by: Brad Spengler <brad.spengler(a)opensrcsec.com> Reported-by: syzbot+46423ed8fa1f1148c6e4(a)syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/6774c98f.050a0220.25abdd.0991.GAE@google.c… Cc: Jann Horn <jannh(a)google.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vma.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) --- a/mm/vma.c~mm-abort-vma_modify-on-merge-out-of-memory-failure +++ a/mm/vma.c @@ -1509,24 +1509,28 @@ int do_vmi_munmap(struct vma_iterator *v static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) { struct vm_area_struct *vma = vmg->vma; + unsigned long start = vmg->start; + unsigned long end = vmg->end; struct vm_area_struct *merged; /* First, try to merge. */ merged = vma_merge_existing_range(vmg); if (merged) return merged; + if (vmg_nomem(vmg)) + return ERR_PTR(-ENOMEM); /* Split any preceding portion of the VMA. */ - if (vma->vm_start < vmg->start) { - int err = split_vma(vmg->vmi, vma, vmg->start, 1); + if (vma->vm_start < start) { + int err = split_vma(vmg->vmi, vma, start, 1); if (err) return ERR_PTR(err); } /* Split any trailing portion of the VMA. */ - if (vma->vm_end > vmg->end) { - int err = split_vma(vmg->vmi, vma, vmg->end, 0); + if (vma->vm_end > end) { + int err = split_vma(vmg->vmi, vma, end, 0); if (err) return ERR_PTR(err); _ Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are mm-simplify-vma-merge-structure-and-expand-comments.patch mm-further-refactor-commit_merge.patch mm-eliminate-adj_start-parameter-from-commit_merge.patch mm-make-vmg-target-consistent-and-further-simplify-commit_merge.patch mm-completely-abstract-unnecessary-adj_start-calculation.patch mm-madvise-split-out-mmap-locking-operations-for-madvise-fix.patch mm-use-read-write_once-for-vma-vm_flags-on-migrate-mprotect.patch mm-refactor-rmap_walk_file-to-separate-out-traversal-logic.patch mm-provide-mapping_wrprotect_range-function.patch fb_defio-do-not-use-deprecated-page-mapping-index-fields.patch fb_defio-do-not-use-deprecated-page-mapping-index-fields-fix.patch mm-allow-guard-regions-in-file-backed-and-read-only-mappings.patch selftests-mm-rename-guard-pages-to-guard-regions.patch selftests-mm-rename-guard-pages-to-guard-regions-fix.patch tools-selftests-expand-all-guard-region-tests-to-file-backed.patch tools-selftests-add-file-shmem-backed-mapping-guard-region-tests.patch fs-proc-task_mmu-add-guard-region-bit-to-pagemap.patch tools-selftests-add-guard-region-test-for-proc-pid-pagemap.patch tools-selftests-add-guard-region-test-for-proc-pid-pagemap-fix.patch mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch mm-mremap-refactor-mremap-system-call-implementation.patch mm-mremap-introduce-and-use-vma_remap_struct-threaded-state.patch mm-mremap-initial-refactor-of-move_vma.patch mm-mremap-complete-refactor-of-move_vma.patch mm-mremap-refactor-move_page_tables-abstracting-state.patch mm-mremap-thread-state-through-move-page-table-operation.patch

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] mm-hugetlb-wait-for-hugetlb-folios-to-be-freed.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/hugetlb: wait for hugetlb folios to be freed has been removed from the -mm tree. Its filename was mm-hugetlb-wait-for-hugetlb-folios-to-be-freed.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ge Yang <yangge1116(a)126.com> Subject: mm/hugetlb: wait for hugetlb folios to be freed Date: Wed, 19 Feb 2025 11:46:44 +0800 Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.… Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116(a)126.com> Reviewed-by: Muchun Song <muchun.song(a)linux.dev> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com> Cc: Barry Song <21cnbao(a)gmail.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/hugetlb.h | 5 +++++ mm/hugetlb.c | 8 ++++++++ mm/page_isolation.c | 10 ++++++++++ 3 files changed, 23 insertions(+) --- a/include/linux/hugetlb.h~mm-hugetlb-wait-for-hugetlb-folios-to-be-freed +++ a/include/linux/hugetlb.h @@ -682,6 +682,7 @@ struct huge_bootmem_page { int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn); +void wait_for_freed_hugetlb_folios(void); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, @@ -1066,6 +1067,10 @@ static inline int replace_free_hugepage_ return 0; } +static inline void wait_for_freed_hugetlb_folios(void) +{ +} + static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner) --- a/mm/hugetlb.c~mm-hugetlb-wait-for-hugetlb-folios-to-be-freed +++ a/mm/hugetlb.c @@ -2943,6 +2943,14 @@ int replace_free_hugepage_folios(unsigne return ret; } +void wait_for_freed_hugetlb_folios(void) +{ + if (llist_empty(&hpage_freelist)) + return; + + flush_work(&free_hpage_work); +} + typedef enum { /* * For either 0/1: we checked the per-vma resv map, and one resv --- a/mm/page_isolation.c~mm-hugetlb-wait-for-hugetlb-folios-to-be-freed +++ a/mm/page_isolation.c @@ -608,6 +608,16 @@ int test_pages_isolated(unsigned long st int ret; /* + * Due to the deferred freeing of hugetlb folios, the hugepage folios may + * not immediately release to the buddy system. This can cause PageBuddy() + * to fail in __test_page_isolated_in_pageblock(). To ensure that the + * hugetlb folios are properly released back to the buddy system, we + * invoke the wait_for_freed_hugetlb_folios() function to wait for the + * release to complete. + */ + wait_for_freed_hugetlb_folios(); + + /* * Note: pageblock_nr_pages != MAX_PAGE_ORDER. Then, chunks of free * pages are not aligned to pageblock_nr_pages. * Then we just check migratetype first. _ Patches currently in -mm which might be from yangge1116(a)126.com are

9 months, 2 weeks

1
0
0 0

[merged mm-hotfixes-stable] hwpoison-memory_hotplug-lock-folio-before-unmap-hwpoisoned-folio.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio has been removed from the -mm tree. Its filename was hwpoison-memory_hotplug-lock-folio-before-unmap-hwpoisoned-folio.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ma Wupeng <mawupeng1(a)huawei.com> Subject: hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio Date: Mon, 17 Feb 2025 09:43:29 +0800 Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined) add page poison checks in do_migrate_range in order to make offline hwpoisoned page possible by introducing isolate_lru_page and try_to_unmap for hwpoisoned page. However folio lock must be held before calling try_to_unmap. Add it to fix this problem. Warning will be produced if folio is not locked during unmap: ------------[ cut here ]------------ kernel BUG at ./include/linux/swapops.h:400! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 4 UID: 0 PID: 411 Comm: bash Tainted: G W 6.13.0-rc1-00016-g3c434c7ee82a-dirty #41 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0xb08/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0xb08/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c Code: f9407be0 b5fff320 d4210000 17ffff97 (d4210000) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250217014329.3610326-4-mawupeng1@huawei.com Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Ma Wupeng <mawupeng1(a)huawei.com> Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory_hotplug.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/mm/memory_hotplug.c~hwpoison-memory_hotplug-lock-folio-before-unmap-hwpoisoned-folio +++ a/mm/memory_hotplug.c @@ -1832,8 +1832,11 @@ static void do_migrate_range(unsigned lo (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) { if (WARN_ON(folio_test_lru(folio))) folio_isolate_lru(folio); - if (folio_mapped(folio)) + if (folio_mapped(folio)) { + folio_lock(folio); unmap_poisoned_folio(folio, pfn, false); + folio_unlock(folio); + } goto put_folio; } _ Patches currently in -mm which might be from mawupeng1(a)huawei.com are

9 months, 2 weeks

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror March 2025