August 2023 - Linux-stable-mirror

[merged mm-hotfixes-stable] hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: hugetlb: do not clear hugetlb dtor until allocating vmemmap has been removed from the -mm tree. Its filename was hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Mike Kravetz <mike.kravetz(a)oracle.com> Subject: hugetlb: do not clear hugetlb dtor until allocating vmemmap Date: Tue, 11 Jul 2023 15:09:41 -0700 Patch series "Fix hugetlb free path race with memory errors". In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on HWPOISON hugepages" the race window was discovered. https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/ Freeing a hugetlb page back to low level memory allocators is performed in two steps. 1) Under hugetlb lock, remove page from hugetlb lists and clear destructor 2) Outside lock, allocate vmemmap if necessary and call low level free Between these two steps, the hugetlb page will appear as a normal compound page. However, vmemmap for tail pages could be missing. If a memory error occurs at this time, we could try to update page flags non-existant page structs. A much more detailed description is in the first patch. The first patch addresses the race window. However, it adds a hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page free operation. This could lead to slowdowns if one is freeing a large number of hugetlb pages. The second path optimizes the update_and_free_pages_bulk routine to only take the lock once in bulk operations. The second patch is technically not a bug fix, but includes a Fixes tag and Cc stable to avoid a performance regression. It can be combined with the first, but was done separately make reviewing easier. This patch (of 2): Freeing a hugetlb page and releasing base pages back to the underlying allocator such as buddy or cma is performed in two steps: - remove_hugetlb_folio() is called to remove the folio from hugetlb lists, get a ref on the page and remove hugetlb destructor. This all must be done under the hugetlb lock. After this call, the page can be treated as a normal compound page or a collection of base size pages. - update_and_free_hugetlb_folio() is called to allocate vmemmap if needed and the free routine of the underlying allocator is called on the resulting page. We can not hold the hugetlb lock here. One issue with this scheme is that a memory error could occur between these two steps. In this case, the memory error handling code treats the old hugetlb page as a normal compound page or collection of base pages. It will then try to SetPageHWPoison(page) on the page with an error. If the page with error is a tail page without vmemmap, a write error will occur when trying to set the flag. Address this issue by modifying remove_hugetlb_folio() and update_and_free_hugetlb_folio() such that the hugetlb destructor is not cleared until after allocating vmemmap. Since clearing the destructor requires holding the hugetlb lock, the clearing is done in remove_hugetlb_folio() if the vmemmap is present. This saves a lock/unlock cycle. Otherwise, destructor is cleared in update_and_free_hugetlb_folio() after allocating vmemmap. Note that this will leave hugetlb pages in a state where they are marked free (by hugetlb specific page flag) and have a ref count. This is not a normal state. The only code that would notice is the memory error code, and it is set up to retry in such a case. A subsequent patch will create a routine to do bulk processing of vmemmap allocation. This will eliminate a lock/unlock cycle for each hugetlb page in the case where we are freeing a large number of pages. Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page") Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com> Reviewed-by: Muchun Song <songmuchun(a)bytedance.com> Tested-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Cc: Axel Rasmussen <axelrasmussen(a)google.com> Cc: James Houghton <jthoughton(a)google.com> Cc: Jiaqi Yan <jiaqiyan(a)google.com> Cc: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 51 insertions(+), 24 deletions(-) --- a/mm/hugetlb.c~hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap +++ a/mm/hugetlb.c @@ -1579,9 +1579,37 @@ static inline void destroy_compound_giga unsigned int order) { } #endif +static inline void __clear_hugetlb_destructor(struct hstate *h, + struct folio *folio) +{ + lockdep_assert_held(&hugetlb_lock); + + /* + * Very subtle + * + * For non-gigantic pages set the destructor to the normal compound + * page dtor. This is needed in case someone takes an additional + * temporary ref to the page, and freeing is delayed until they drop + * their reference. + * + * For gigantic pages set the destructor to the null dtor. This + * destructor will never be called. Before freeing the gigantic + * page destroy_compound_gigantic_folio will turn the folio into a + * simple group of pages. After this the destructor does not + * apply. + * + */ + if (hstate_is_gigantic(h)) + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); + else + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); +} + /* - * Remove hugetlb folio from lists, and update dtor so that the folio appears - * as just a compound page. + * Remove hugetlb folio from lists. + * If vmemmap exists for the folio, update dtor so that the folio appears + * as just a compound page. Otherwise, wait until after allocating vmemmap + * to update dtor. * * A reference is held on the folio, except in the case of demote. * @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struc } /* - * Very subtle - * - * For non-gigantic pages set the destructor to the normal compound - * page dtor. This is needed in case someone takes an additional - * temporary ref to the page, and freeing is delayed until they drop - * their reference. - * - * For gigantic pages set the destructor to the null dtor. This - * destructor will never be called. Before freeing the gigantic - * page destroy_compound_gigantic_folio will turn the folio into a - * simple group of pages. After this the destructor does not - * apply. - * - * This handles the case where more than one ref is held when and - * after update_and_free_hugetlb_folio is called. - * - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. + * We can only clear the hugetlb destructor after allocating vmemmap + * pages. Otherwise, someone (memory error handling) may try to write + * to tail struct pages. + */ + if (!folio_test_hugetlb_vmemmap_optimized(folio)) + __clear_hugetlb_destructor(h, folio); + + /* + * In the case of demote we do not ref count the page as it will soon + * be turned into a page of smaller size. */ if (!demote) folio_ref_unfreeze(folio, 1); - if (hstate_is_gigantic(h)) - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); - else - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_fo { int i; struct page *subpage; + bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_fo if (unlikely(folio_test_hwpoison(folio))) folio_clear_hugetlb_hwpoison(folio); + /* + * If vmemmap pages were allocated above, then we need to clear the + * hugetlb destructor under the hugetlb lock. + */ + if (clear_dtor) { + spin_lock_irq(&hugetlb_lock); + __clear_hugetlb_destructor(h, folio); + spin_unlock_irq(&hugetlb_lock); + } + for (i = 0; i < pages_per_huge_page(h); i++) { subpage = folio_page(folio, i); subpage->flags &= ~(1 << PG_locked | 1 << PG_error | _ Patches currently in -mm which might be from mike.kravetz(a)oracle.com are

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] mm-memory-failure-avoid-false-hwpoison-page-mapped-error-info.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: memory-failure: avoid false hwpoison page mapped error info has been removed from the -mm tree. Its filename was mm-memory-failure-avoid-false-hwpoison-page-mapped-error-info.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Miaohe Lin <linmiaohe(a)huawei.com> Subject: mm: memory-failure: avoid false hwpoison page mapped error info Date: Thu, 27 Jul 2023 19:56:42 +0800 folio->_mapcount is overloaded in SLAB, so folio_mapped() has to be done after folio_test_slab() is checked. Otherwise slab folio might be treated as a mapped folio leading to false 'Someone maps the hwpoison page' error info. Link: https://lkml.kernel.org/r/20230727115643.639741-4-linmiaohe@huawei.com Fixes: 230ac719c500 ("mm/hwpoison: don't try to unpoison containment-failed pages") Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org> Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory-failure.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) --- a/mm/memory-failure.c~mm-memory-failure-avoid-false-hwpoison-page-mapped-error-info +++ a/mm/memory-failure.c @@ -2499,6 +2499,13 @@ int unpoison_memory(unsigned long pfn) goto unlock_mutex; } + if (folio_test_slab(folio) || PageTable(&folio->page) || folio_test_reserved(folio)) + goto unlock_mutex; + + /* + * Note that folio->_mapcount is overloaded in SLAB, so the simple test + * in folio_mapped() has to be done after folio_test_slab() is checked. + */ if (folio_mapped(folio)) { unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n", pfn, &unpoison_rs); @@ -2511,9 +2518,6 @@ int unpoison_memory(unsigned long pfn) goto unlock_mutex; } - if (folio_test_slab(folio) || PageTable(&folio->page) || folio_test_reserved(folio)) - goto unlock_mutex; - ghp = get_hwpoison_page(p, MF_UNPOISON); if (!ghp) { if (PageHuge(p)) { _ Patches currently in -mm which might be from linmiaohe(a)huawei.com are mm-mm_initc-update-obsolete-comment-in-get_pfn_range_for_nid.patch mm-memory-failure-fix-unexpected-return-value-in-soft_offline_page.patch mm-memory-failure-fix-potential-page-refcnt-leak-in-memory_failure.patch mm-memory-failure-remove-unneeded-page-state-check-in-shake_page.patch memory-tier-use-helper-function-destroy_memory_type.patch mm-memory-failure-remove-unneeded-inline-annotation.patch mm-mm_initc-remove-obsolete-macro-hash_small.patch mm-page_alloc-avoid-false-page-outside-zone-error-info.patch memory-tier-rename-destroy_memory_type-to-put_memory_type.patch mm-remove-obsolete-comment-above-struct-per_cpu_pages.patch mm-memcg-minor-cleanup-for-mem_cgroup_id_max.patch mm-memory-failure-remove-unneeded-pagehuge-check.patch mm-memory-failure-ensure-moving-hwpoison-flag-to-the-raw-error-pages.patch mm-memory-failure-dont-account-hwpoison_filter-filtered-pages.patch mm-memory-failure-use-local-variable-huge-to-check-hugetlb-page.patch mm-memory-failure-remove-unneeded-header-files.patch mm-memory-failure-minor-cleanup-for-comments-and-codestyle.patch mm-memory-failure-fetch-compound-head-after-extra-page-refcnt-is-held.patch mm-memory-failure-fix-race-window-when-trying-to-get-hugetlb-folio.patch mm-huge_memory-use-rmap_none-when-calling-page_add_anon_rmap.patch mm-memcg-fix-obsolete-comment-above-mem_cgroup_max_reclaim_loops.patch mm-memcg-minor-cleanup-for-mc_handle_present_pte.patch memory-tier-use-helper-macro-__attr_rw.patch mm-fix-obsolete-function-name-above-debug_pagealloc_enabled_static.patch mm-mprotect-fix-obsolete-function-name-in-change_pte_range.patch mm-memcg-fix-obsolete-function-name-in-mem_cgroup_protection.patch mm-memory-failure-add-pageoffline-check.patch mm-page_alloc-avoid-unneeded-alike_pages-calculation.patch mm-memcg-update-obsolete-comment-above-parent_mem_cgroup.patch mm-page_alloc-remove-unneeded-variable-base.patch mm-memcg-fix-wrong-function-name-above-obj_cgroup_charge_zswap.patch

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] mm-memory-failure-fix-potential-unexpected-return-value-from-unpoison_memory.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: memory-failure: fix potential unexpected return value from unpoison_memory() has been removed from the -mm tree. Its filename was mm-memory-failure-fix-potential-unexpected-return-value-from-unpoison_memory.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Miaohe Lin <linmiaohe(a)huawei.com> Subject: mm: memory-failure: fix potential unexpected return value from unpoison_memory() Date: Thu, 27 Jul 2023 19:56:41 +0800 If unpoison_memory() fails to clear page hwpoisoned flag, return value ret is expected to be -EBUSY. But when get_hwpoison_page() returns 1 and fails to clear page hwpoisoned flag due to races, return value will be unexpected 1 leading to users being confused. And there's a code smell that the variable "ret" is used not only to save the return value of unpoison_memory(), but also the return value from get_hwpoison_page(). Make a further cleanup by using another auto-variable solely to save the return value of get_hwpoison_page() as suggested by Naoya. Link: https://lkml.kernel.org/r/20230727115643.639741-3-linmiaohe@huawei.com Fixes: bf181c582588 ("mm/hwpoison: fix unpoison_memory()") Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory-failure.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) --- a/mm/memory-failure.c~mm-memory-failure-fix-potential-unexpected-return-value-from-unpoison_memory +++ a/mm/memory-failure.c @@ -2466,7 +2466,7 @@ int unpoison_memory(unsigned long pfn) { struct folio *folio; struct page *p; - int ret = -EBUSY; + int ret = -EBUSY, ghp; unsigned long count = 1; bool huge = false; static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, @@ -2514,29 +2514,28 @@ int unpoison_memory(unsigned long pfn) if (folio_test_slab(folio) || PageTable(&folio->page) || folio_test_reserved(folio)) goto unlock_mutex; - ret = get_hwpoison_page(p, MF_UNPOISON); - if (!ret) { + ghp = get_hwpoison_page(p, MF_UNPOISON); + if (!ghp) { if (PageHuge(p)) { huge = true; count = folio_free_raw_hwp(folio, false); - if (count == 0) { - ret = -EBUSY; + if (count == 0) goto unlock_mutex; - } } ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY; - } else if (ret < 0) { - if (ret == -EHWPOISON) { + } else if (ghp < 0) { + if (ghp == -EHWPOISON) { ret = put_page_back_buddy(p) ? 0 : -EBUSY; - } else + } else { + ret = ghp; unpoison_pr_info("Unpoison: failed to grab page %#lx\n", pfn, &unpoison_rs); + } } else { if (PageHuge(p)) { huge = true; count = folio_free_raw_hwp(folio, false); if (count == 0) { - ret = -EBUSY; folio_put(folio); goto unlock_mutex; } _ Patches currently in -mm which might be from linmiaohe(a)huawei.com are mm-mm_initc-update-obsolete-comment-in-get_pfn_range_for_nid.patch mm-memory-failure-fix-unexpected-return-value-in-soft_offline_page.patch mm-memory-failure-fix-potential-page-refcnt-leak-in-memory_failure.patch mm-memory-failure-remove-unneeded-page-state-check-in-shake_page.patch memory-tier-use-helper-function-destroy_memory_type.patch mm-memory-failure-remove-unneeded-inline-annotation.patch mm-mm_initc-remove-obsolete-macro-hash_small.patch mm-page_alloc-avoid-false-page-outside-zone-error-info.patch memory-tier-rename-destroy_memory_type-to-put_memory_type.patch mm-remove-obsolete-comment-above-struct-per_cpu_pages.patch mm-memcg-minor-cleanup-for-mem_cgroup_id_max.patch mm-memory-failure-remove-unneeded-pagehuge-check.patch mm-memory-failure-ensure-moving-hwpoison-flag-to-the-raw-error-pages.patch mm-memory-failure-dont-account-hwpoison_filter-filtered-pages.patch mm-memory-failure-use-local-variable-huge-to-check-hugetlb-page.patch mm-memory-failure-remove-unneeded-header-files.patch mm-memory-failure-minor-cleanup-for-comments-and-codestyle.patch mm-memory-failure-fetch-compound-head-after-extra-page-refcnt-is-held.patch mm-memory-failure-fix-race-window-when-trying-to-get-hugetlb-folio.patch mm-huge_memory-use-rmap_none-when-calling-page_add_anon_rmap.patch mm-memcg-fix-obsolete-comment-above-mem_cgroup_max_reclaim_loops.patch mm-memcg-minor-cleanup-for-mc_handle_present_pte.patch memory-tier-use-helper-macro-__attr_rw.patch mm-fix-obsolete-function-name-above-debug_pagealloc_enabled_static.patch mm-mprotect-fix-obsolete-function-name-in-change_pte_range.patch mm-memcg-fix-obsolete-function-name-in-mem_cgroup_protection.patch mm-memory-failure-add-pageoffline-check.patch mm-page_alloc-avoid-unneeded-alike_pages-calculation.patch mm-memcg-update-obsolete-comment-above-parent_mem_cgroup.patch mm-page_alloc-remove-unneeded-variable-base.patch mm-memcg-fix-wrong-function-name-above-obj_cgroup_charge_zswap.patch

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] mm-swapfile-fix-wrong-swap-entry-type-for-hwpoisoned-swapcache-page.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page has been removed from the -mm tree. Its filename was mm-swapfile-fix-wrong-swap-entry-type-for-hwpoisoned-swapcache-page.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Miaohe Lin <linmiaohe(a)huawei.com> Subject: mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page Date: Thu, 27 Jul 2023 19:56:40 +0800 Patch series "A few fixup patches for mm", v2. This series contains a few fixup patches to fix potential unexpected return value, fix wrong swap entry type for hwpoisoned swapcache page and so on. More details can be found in the respective changelogs. This patch (of 3): Hwpoisoned dirty swap cache page is kept in the swap cache and there's simple interception code in do_swap_page() to catch it. But when trying to swapoff, unuse_pte() will wrongly install a general sense of "future accesses are invalid" swap entry for hwpoisoned swap cache page due to unaware of such type of page. The user will receive SIGBUS signal without expected BUS_MCEERR_AR payload. BTW, typo 'hwposioned' is fixed. Link: https://lkml.kernel.org/r/20230727115643.639741-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20230727115643.639741-2-linmiaohe@huawei.com Fixes: 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()") Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/ksm.c | 2 ++ mm/swapfile.c | 8 ++++---- 2 files changed, 6 insertions(+), 4 deletions(-) --- a/mm/ksm.c~mm-swapfile-fix-wrong-swap-entry-type-for-hwpoisoned-swapcache-page +++ a/mm/ksm.c @@ -2784,6 +2784,8 @@ struct page *ksm_might_need_to_copy(stru anon_vma->root == vma->anon_vma->root) { return page; /* still no need to copy it */ } + if (PageHWPoison(page)) + return ERR_PTR(-EHWPOISON); if (!PageUptodate(page)) return page; /* let do_swap_page report the error */ --- a/mm/swapfile.c~mm-swapfile-fix-wrong-swap-entry-type-for-hwpoisoned-swapcache-page +++ a/mm/swapfile.c @@ -1746,7 +1746,7 @@ static int unuse_pte(struct vm_area_stru struct page *swapcache; spinlock_t *ptl; pte_t *pte, new_pte, old_pte; - bool hwposioned = false; + bool hwpoisoned = PageHWPoison(page); int ret = 1; swapcache = page; @@ -1754,7 +1754,7 @@ static int unuse_pte(struct vm_area_stru if (unlikely(!page)) return -ENOMEM; else if (unlikely(PTR_ERR(page) == -EHWPOISON)) - hwposioned = true; + hwpoisoned = true; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); if (unlikely(!pte || !pte_same_as_swp(ptep_get(pte), @@ -1765,11 +1765,11 @@ static int unuse_pte(struct vm_area_stru old_pte = ptep_get(pte); - if (unlikely(hwposioned || !PageUptodate(page))) { + if (unlikely(hwpoisoned || !PageUptodate(page))) { swp_entry_t swp_entry; dec_mm_counter(vma->vm_mm, MM_SWAPENTS); - if (hwposioned) { + if (hwpoisoned) { swp_entry = make_hwpoison_entry(swapcache); page = swapcache; } else { _ Patches currently in -mm which might be from linmiaohe(a)huawei.com are mm-mm_initc-update-obsolete-comment-in-get_pfn_range_for_nid.patch mm-memory-failure-fix-unexpected-return-value-in-soft_offline_page.patch mm-memory-failure-fix-potential-page-refcnt-leak-in-memory_failure.patch mm-memory-failure-remove-unneeded-page-state-check-in-shake_page.patch memory-tier-use-helper-function-destroy_memory_type.patch mm-memory-failure-remove-unneeded-inline-annotation.patch mm-mm_initc-remove-obsolete-macro-hash_small.patch mm-page_alloc-avoid-false-page-outside-zone-error-info.patch memory-tier-rename-destroy_memory_type-to-put_memory_type.patch mm-remove-obsolete-comment-above-struct-per_cpu_pages.patch mm-memcg-minor-cleanup-for-mem_cgroup_id_max.patch mm-memory-failure-remove-unneeded-pagehuge-check.patch mm-memory-failure-ensure-moving-hwpoison-flag-to-the-raw-error-pages.patch mm-memory-failure-dont-account-hwpoison_filter-filtered-pages.patch mm-memory-failure-use-local-variable-huge-to-check-hugetlb-page.patch mm-memory-failure-remove-unneeded-header-files.patch mm-memory-failure-minor-cleanup-for-comments-and-codestyle.patch mm-memory-failure-fetch-compound-head-after-extra-page-refcnt-is-held.patch mm-memory-failure-fix-race-window-when-trying-to-get-hugetlb-folio.patch mm-huge_memory-use-rmap_none-when-calling-page_add_anon_rmap.patch mm-memcg-fix-obsolete-comment-above-mem_cgroup_max_reclaim_loops.patch mm-memcg-minor-cleanup-for-mc_handle_present_pte.patch memory-tier-use-helper-macro-__attr_rw.patch mm-fix-obsolete-function-name-above-debug_pagealloc_enabled_static.patch mm-mprotect-fix-obsolete-function-name-in-change_pte_range.patch mm-memcg-fix-obsolete-function-name-in-mem_cgroup_protection.patch mm-memory-failure-add-pageoffline-check.patch mm-page_alloc-avoid-unneeded-alike_pages-calculation.patch mm-memcg-update-obsolete-comment-above-parent_mem_cgroup.patch mm-page_alloc-remove-unneeded-variable-base.patch mm-memcg-fix-wrong-function-name-above-obj_cgroup_charge_zswap.patch

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] radix-tree-test-suite-fix-incorrect-allocation-size-for-pthreads.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: radix tree test suite: fix incorrect allocation size for pthreads has been removed from the -mm tree. Its filename was radix-tree-test-suite-fix-incorrect-allocation-size-for-pthreads.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Colin Ian King <colin.i.king(a)gmail.com> Subject: radix tree test suite: fix incorrect allocation size for pthreads Date: Thu, 27 Jul 2023 17:09:30 +0100 Currently the pthread allocation for each array item is based on the size of a pthread_t pointer and should be the size of the pthread_t structure, so the allocation is under-allocating the correct size. Fix this by using the size of each element in the pthreads array. Static analysis cppcheck reported: tools/testing/radix-tree/regression1.c:180:2: warning: Size of pointer 'threads' used instead of size of its data. [pointerSize] Link: https://lkml.kernel.org/r/20230727160930.632674-1-colin.i.king@gmail.com Fixes: 1366c37ed84b ("radix tree test harness") Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com> Cc: Konstantin Khlebnikov <koct9i(a)gmail.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- tools/testing/radix-tree/regression1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/tools/testing/radix-tree/regression1.c~radix-tree-test-suite-fix-incorrect-allocation-size-for-pthreads +++ a/tools/testing/radix-tree/regression1.c @@ -177,7 +177,7 @@ void regression1_test(void) nr_threads = 2; pthread_barrier_init(&worker_barrier, NULL, nr_threads); - threads = malloc(nr_threads * sizeof(pthread_t *)); + threads = malloc(nr_threads * sizeof(*threads)); for (i = 0; i < nr_threads; i++) { arg = i; _ Patches currently in -mm which might be from colin.i.king(a)gmail.com are fs-hfsplus-make-extend-error-rate-limited.patch

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] crypto-cifs-fix-error-handling-in-extract_iter_to_sg.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: crypto, cifs: fix error handling in extract_iter_to_sg() has been removed from the -mm tree. Its filename was crypto-cifs-fix-error-handling-in-extract_iter_to_sg.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: David Howells <dhowells(a)redhat.com> Subject: crypto, cifs: fix error handling in extract_iter_to_sg() Date: Wed, 26 Jul 2023 11:57:56 +0100 Fix error handling in extract_iter_to_sg(). Pages need to be unpinned, not put in extract_user_to_sg() when handling IOVEC/UBUF sources. The bug may result in a warning like the following: WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 __lse_atomic_add arch/arm64/include/asm/atomic_lse.h:27 [inline] WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 arch_atomic_add arch/arm64/include/asm/atomic.h:28 [inline] WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 raw_atomic_add include/linux/atomic/atomic-arch-fallback.h:537 [inline] WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 atomic_add include/linux/atomic/atomic-instrumented.h:105 [inline] WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 try_grab_page+0x108/0x160 mm/gup.c:252 ... pc : try_grab_page+0x108/0x160 mm/gup.c:229 lr : follow_page_pte+0x174/0x3e4 mm/gup.c:651 ... Call trace: __lse_atomic_add arch/arm64/include/asm/atomic_lse.h:27 [inline] arch_atomic_add arch/arm64/include/asm/atomic.h:28 [inline] raw_atomic_add include/linux/atomic/atomic-arch-fallback.h:537 [inline] atomic_add include/linux/atomic/atomic-instrumented.h:105 [inline] try_grab_page+0x108/0x160 mm/gup.c:252 follow_pmd_mask mm/gup.c:734 [inline] follow_pud_mask mm/gup.c:765 [inline] follow_p4d_mask mm/gup.c:782 [inline] follow_page_mask+0x12c/0x2e4 mm/gup.c:839 __get_user_pages+0x174/0x30c mm/gup.c:1217 __get_user_pages_locked mm/gup.c:1448 [inline] __gup_longterm_locked+0x94/0x8f4 mm/gup.c:2142 internal_get_user_pages_fast+0x970/0xb60 mm/gup.c:3140 pin_user_pages_fast+0x4c/0x60 mm/gup.c:3246 iov_iter_extract_user_pages lib/iov_iter.c:1768 [inline] iov_iter_extract_pages+0xc8/0x54c lib/iov_iter.c:1831 extract_user_to_sg lib/scatterlist.c:1123 [inline] extract_iter_to_sg lib/scatterlist.c:1349 [inline] extract_iter_to_sg+0x26c/0x6fc lib/scatterlist.c:1339 hash_sendmsg+0xc0/0x43c crypto/algif_hash.c:117 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x54/0x60 net/socket.c:748 ____sys_sendmsg+0x270/0x2ac net/socket.c:2494 ___sys_sendmsg+0x80/0xdc net/socket.c:2548 __sys_sendmsg+0x68/0xc4 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __arm64_sys_sendmsg+0x24/0x30 net/socket.c:2584 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall+0x48/0x114 arch/arm64/kernel/syscall.c:52 el0_svc_common.constprop.0+0x44/0xe4 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x38/0xa4 arch/arm64/kernel/syscall.c:191 el0_svc+0x2c/0xb0 arch/arm64/kernel/entry-common.c:647 el0t_64_sync_handler+0xc0/0xc4 arch/arm64/kernel/entry-common.c:665 el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:591 Link: https://lkml.kernel.org/r/20571.1690369076@warthog.procyon.org.uk Fixes: 018584697533 ("netfs: Add a function to extract an iterator into a scatterlist") Reported-by: syzbot+9b82859567f2e50c123e(a)syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-mm/000000000000273d0105ff97bf56@google.com/ Signed-off-by: David Howells <dhowells(a)redhat.com> Reviewed-by: David Hildenbrand <david(a)redhat.com> Acked-by: Steve French <stfrench(a)microsoft.com> Cc: Sven Schnelle <svens(a)linux.ibm.com> Cc: Herbert Xu <herbert(a)gondor.apana.org.au> Cc: Jeff Layton <jlayton(a)kernel.org> Cc: Shyam Prasad N <nspmangalore(a)gmail.com> Cc: Rohith Surabattula <rohiths.msft(a)gmail.com> Cc: Jens Axboe <axboe(a)kernel.dk> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/lib/scatterlist.c~crypto-cifs-fix-error-handling-in-extract_iter_to_sg +++ a/lib/scatterlist.c @@ -1148,7 +1148,7 @@ static ssize_t extract_user_to_sg(struct failed: while (sgtable->nents > sgtable->orig_nents) - put_page(sg_page(&sgtable->sgl[--sgtable->nents])); + unpin_user_page(sg_page(&sgtable->sgl[--sgtable->nents])); return res; } _ Patches currently in -mm which might be from dhowells(a)redhat.com are mm-merge-folio_has_private-filemap_release_folio-call-pairs.patch mm-netfs-fscache-stop-read-optimisation-when-folio-removed-from-pagecache.patch

2 years, 4 months

1
0
0 0

[merged mm-hotfixes-stable] zsmalloc-fix-races-between-modifications-of-fullness-and-isolated.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: zsmalloc: fix races between modifications of fullness and isolated has been removed from the -mm tree. Its filename was zsmalloc-fix-races-between-modifications-of-fullness-and-isolated.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Andrew Yang <andrew.yang(a)mediatek.com> Subject: zsmalloc: fix races between modifications of fullness and isolated Date: Fri, 21 Jul 2023 14:37:01 +0800 We encountered many kernel exceptions of VM_BUG_ON(zspage->isolated == 0) in dec_zspage_isolation() and BUG_ON(!pages[1]) in zs_unmap_object() lately. This issue only occurs when migration and reclamation occur at the same time. With our memory stress test, we can reproduce this issue several times a day. We have no idea why no one else encountered this issue. BTW, we switched to the new kernel version with this defect a few months ago. Since fullness and isolated share the same unsigned int, modifications of them should be protected by the same lock. [andrew.yang(a)mediatek.com: move comment] Link: https://lkml.kernel.org/r/20230727062910.6337-1-andrew.yang@mediatek.com Link: https://lkml.kernel.org/r/20230721063705.11455-1-andrew.yang@mediatek.com Fixes: c4549b871102 ("zsmalloc: remove zspage isolation for migration") Signed-off-by: Andrew Yang <andrew.yang(a)mediatek.com> Reviewed-by: Sergey Senozhatsky <senozhatsky(a)chromium.org> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com> Cc: Matthias Brugger <matthias.bgg(a)gmail.com> Cc: Minchan Kim <minchan(a)kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/zsmalloc.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) --- a/mm/zsmalloc.c~zsmalloc-fix-races-between-modifications-of-fullness-and-isolated +++ a/mm/zsmalloc.c @@ -1798,6 +1798,7 @@ static void replace_sub_page(struct size static bool zs_page_isolate(struct page *page, isolate_mode_t mode) { + struct zs_pool *pool; struct zspage *zspage; /* @@ -1807,9 +1808,10 @@ static bool zs_page_isolate(struct page VM_BUG_ON_PAGE(PageIsolated(page), page); zspage = get_zspage(page); - migrate_write_lock(zspage); + pool = zspage->pool; + spin_lock(&pool->lock); inc_zspage_isolation(zspage); - migrate_write_unlock(zspage); + spin_unlock(&pool->lock); return true; } @@ -1875,12 +1877,12 @@ static int zs_page_migrate(struct page * kunmap_atomic(s_addr); replace_sub_page(class, zspage, newpage, page); + dec_zspage_isolation(zspage); /* * Since we complete the data copy and set up new zspage structure, * it's okay to release the pool's lock. */ spin_unlock(&pool->lock); - dec_zspage_isolation(zspage); migrate_write_unlock(zspage); get_page(newpage); @@ -1897,14 +1899,16 @@ static int zs_page_migrate(struct page * static void zs_page_putback(struct page *page) { + struct zs_pool *pool; struct zspage *zspage; VM_BUG_ON_PAGE(!PageIsolated(page), page); zspage = get_zspage(page); - migrate_write_lock(zspage); + pool = zspage->pool; + spin_lock(&pool->lock); dec_zspage_isolation(zspage); - migrate_write_unlock(zspage); + spin_unlock(&pool->lock); } static const struct movable_operations zsmalloc_mops = { _ Patches currently in -mm which might be from andrew.yang(a)mediatek.com are fs-drop_caches-draining-pages-before-dropping-caches.patch

2 years, 4 months

1
0
0 0

[PATCH v4 0/6] make vma locking more obvious

by Suren Baghdasaryan

During recent vma locking patch reviews Linus and Jann Horn noted a number of issues with vma locking and suggested improvements: 1. walk_page_range() does not have ability to write-lock a vma during the walk when it's done under mmap_write_lock. For example s390_reset_cmma(). 2. Vma locking is hidden inside vm_flags modifiers and is hard to follow. Suggestion is to change vm_flags_reset{_once} to assert that vma is write-locked and require an explicit locking. 3. Same issue with vma_prepare() hiding vma locking. 4. In userfaultfd vm_flags are modified after vma->vm_userfaultfd_ctx and page faults can operate on a context while it's changed. 5. do_brk_flags() and __install_special_mapping() not locking a newly created vma before adding it into the mm. While not strictly a problem, this is fragile if vma is modified after insertion, as in the mmap_region() case which was recently fixed. Suggestion is to always lock a new vma before inserting it and making it visible to page faults. 6. vma_assert_write_locked() for CONFIG_PER_VMA_LOCK=n would benefit from being mmap_assert_write_locked() instead of no-op and then any place which operates on a vma and calls mmap_assert_write_locked() can be converted into vma_assert_write_locked(). I CC'ed stable only on the first patch because others are cleanups and the bug in userfaultfd does not affect stable (lock_vma_under_rcu prevents uffds from being handled under vma lock protection). However I would be happy if the whole series is merged into stable 6.4 since it makes vma locking more maintainable. The patches apply cleanly over Linus' ToT and will conflict when applied over mm-unstable due to missing [1]. The conflict can be easily resolved by ignoring conflicting deletions but probably simpler to take [1] into mm-unstable and avoid later conflict. [1] commit 6c21e066f925 ("mm/mempolicy: Take VMA lock before replacing policy") Changes since v3: - changed vma locking in vma_merge to avoid locking prev when not necessary, per Liam Suren Baghdasaryan (6): mm: enable page walking API to lock vmas during the walk mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap mm: replace mmap with vma write lock assertions when operating on a vma mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once mm: always lock new vma before inserting into vma tree mm: move vma locking out of vma_prepare and dup_anon_vma arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + arch/powerpc/mm/book3s64/subpage_prot.c | 1 + arch/riscv/mm/pageattr.c | 1 + arch/s390/mm/gmap.c | 5 ++++ fs/proc/task_mmu.c | 5 ++++ fs/userfaultfd.c | 6 +++++ include/linux/mm.h | 13 ++++++--- include/linux/pagewalk.h | 11 ++++++++ mm/damon/vaddr.c | 2 ++ mm/hmm.c | 1 + mm/hugetlb.c | 2 +- mm/khugepaged.c | 5 ++-- mm/ksm.c | 25 ++++++++++------- mm/madvise.c | 8 +++--- mm/memcontrol.c | 2 ++ mm/memory-failure.c | 1 + mm/memory.c | 2 +- mm/mempolicy.c | 22 +++++++++------ mm/migrate_device.c | 1 + mm/mincore.c | 1 + mm/mlock.c | 4 ++- mm/mmap.c | 33 +++++++++++++++-------- mm/mprotect.c | 2 ++ mm/pagewalk.c | 36 ++++++++++++++++++++++--- mm/vmscan.c | 1 + 25 files changed, 148 insertions(+), 43 deletions(-) -- 2.41.0.585.gd2178a4bd4-goog

2 years, 4 months

3
9
0 0

+ mm-enable-page-walking-api-to-lock-vmas-during-the-walk.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm: enable page walking API to lock vmas during the walk has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-enable-page-walking-api-to-lock-vmas-during-the-walk.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Suren Baghdasaryan <surenb(a)google.com> Subject: mm: enable page walking API to lock vmas during the walk Date: Fri, 4 Aug 2023 08:27:19 -0700 walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb(a)google.com> Suggested-by: Linus Torvalds <torvalds(a)linuxfoundation.org> Suggested-by: Jann Horn <jannh(a)google.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Davidlohr Bueso <dave(a)stgolabs.net> Cc: Hugh Dickins <hughd(a)google.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Laurent Dufour <ldufour(a)linux.ibm.com> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Michel Lespinasse <michel(a)lespinasse.org> Cc: Peter Xu <peterx(a)redhat.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/powerpc/mm/book3s64/subpage_prot.c | 1 arch/riscv/mm/pageattr.c | 1 arch/s390/mm/gmap.c | 5 +++ fs/proc/task_mmu.c | 5 +++ include/linux/pagewalk.h | 11 ++++++ mm/damon/vaddr.c | 2 + mm/hmm.c | 1 mm/ksm.c | 25 +++++++++------ mm/madvise.c | 3 + mm/memcontrol.c | 2 + mm/memory-failure.c | 1 mm/mempolicy.c | 22 ++++++++----- mm/migrate_device.c | 1 mm/mincore.c | 1 mm/mlock.c | 1 mm/mprotect.c | 1 mm/pagewalk.c | 36 ++++++++++++++++++++-- mm/vmscan.c | 1 18 files changed, 100 insertions(+), 20 deletions(-) --- a/arch/powerpc/mm/book3s64/subpage_prot.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/arch/powerpc/mm/book3s64/subpage_prot.c @@ -145,6 +145,7 @@ static int subpage_walk_pmd_entry(pmd_t static const struct mm_walk_ops subpage_walk_ops = { .pmd_entry = subpage_walk_pmd_entry, + .walk_lock = PGWALK_WRLOCK_VERIFY, }; static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, --- a/arch/riscv/mm/pageattr.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/arch/riscv/mm/pageattr.c @@ -102,6 +102,7 @@ static const struct mm_walk_ops pageattr .pmd_entry = pageattr_pmd_entry, .pte_entry = pageattr_pte_entry, .pte_hole = pageattr_pte_hole, + .walk_lock = PGWALK_RDLOCK, }; static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, --- a/arch/s390/mm/gmap.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/arch/s390/mm/gmap.c @@ -2514,6 +2514,7 @@ static int thp_split_walk_pmd_entry(pmd_ static const struct mm_walk_ops thp_split_walk_ops = { .pmd_entry = thp_split_walk_pmd_entry, + .walk_lock = PGWALK_WRLOCK_VERIFY, }; static inline void thp_split_mm(struct mm_struct *mm) @@ -2565,6 +2566,7 @@ static int __zap_zero_pages(pmd_t *pmd, static const struct mm_walk_ops zap_zero_walk_ops = { .pmd_entry = __zap_zero_pages, + .walk_lock = PGWALK_WRLOCK, }; /* @@ -2655,6 +2657,7 @@ static const struct mm_walk_ops enable_s .hugetlb_entry = __s390_enable_skey_hugetlb, .pte_entry = __s390_enable_skey_pte, .pmd_entry = __s390_enable_skey_pmd, + .walk_lock = PGWALK_WRLOCK, }; int s390_enable_skey(void) @@ -2692,6 +2695,7 @@ static int __s390_reset_cmma(pte_t *pte, static const struct mm_walk_ops reset_cmma_walk_ops = { .pte_entry = __s390_reset_cmma, + .walk_lock = PGWALK_WRLOCK, }; void s390_reset_cmma(struct mm_struct *mm) @@ -2728,6 +2732,7 @@ static int s390_gather_pages(pte_t *ptep static const struct mm_walk_ops gather_pages_ops = { .pte_entry = s390_gather_pages, + .walk_lock = PGWALK_RDLOCK, }; /* --- a/fs/proc/task_mmu.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/fs/proc/task_mmu.c @@ -757,12 +757,14 @@ static int smaps_hugetlb_range(pte_t *pt static const struct mm_walk_ops smaps_walk_ops = { .pmd_entry = smaps_pte_range, .hugetlb_entry = smaps_hugetlb_range, + .walk_lock = PGWALK_RDLOCK, }; static const struct mm_walk_ops smaps_shmem_walk_ops = { .pmd_entry = smaps_pte_range, .hugetlb_entry = smaps_hugetlb_range, .pte_hole = smaps_pte_hole, + .walk_lock = PGWALK_RDLOCK, }; /* @@ -1244,6 +1246,7 @@ static int clear_refs_test_walk(unsigned static const struct mm_walk_ops clear_refs_walk_ops = { .pmd_entry = clear_refs_pte_range, .test_walk = clear_refs_test_walk, + .walk_lock = PGWALK_WRLOCK, }; static ssize_t clear_refs_write(struct file *file, const char __user *buf, @@ -1621,6 +1624,7 @@ static const struct mm_walk_ops pagemap_ .pmd_entry = pagemap_pmd_range, .pte_hole = pagemap_pte_hole, .hugetlb_entry = pagemap_hugetlb_range, + .walk_lock = PGWALK_RDLOCK, }; /* @@ -1934,6 +1938,7 @@ static int gather_hugetlb_stats(pte_t *p static const struct mm_walk_ops show_numa_ops = { .hugetlb_entry = gather_hugetlb_stats, .pmd_entry = gather_pte_stats, + .walk_lock = PGWALK_RDLOCK, }; /* --- a/include/linux/pagewalk.h~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/include/linux/pagewalk.h @@ -6,6 +6,16 @@ struct mm_walk; +/* Locking requirement during a page walk. */ +enum page_walk_lock { + /* mmap_lock should be locked for read to stabilize the vma tree */ + PGWALK_RDLOCK = 0, + /* vma will be write-locked during the walk */ + PGWALK_WRLOCK = 1, + /* vma is expected to be already write-locked during the walk */ + PGWALK_WRLOCK_VERIFY = 2, +}; + /** * struct mm_walk_ops - callbacks for walk_page_range * @pgd_entry: if set, called for each non-empty PGD (top-level) entry @@ -66,6 +76,7 @@ struct mm_walk_ops { int (*pre_vma)(unsigned long start, unsigned long end, struct mm_walk *walk); void (*post_vma)(struct mm_walk *walk); + enum page_walk_lock walk_lock; }; /* --- a/mm/damon/vaddr.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/damon/vaddr.c @@ -386,6 +386,7 @@ out: static const struct mm_walk_ops damon_mkold_ops = { .pmd_entry = damon_mkold_pmd_entry, .hugetlb_entry = damon_mkold_hugetlb_entry, + .walk_lock = PGWALK_RDLOCK, }; static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) @@ -525,6 +526,7 @@ out: static const struct mm_walk_ops damon_young_ops = { .pmd_entry = damon_young_pmd_entry, .hugetlb_entry = damon_young_hugetlb_entry, + .walk_lock = PGWALK_RDLOCK, }; static bool damon_va_young(struct mm_struct *mm, unsigned long addr, --- a/mm/hmm.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/hmm.c @@ -562,6 +562,7 @@ static const struct mm_walk_ops hmm_walk .pte_hole = hmm_vma_walk_hole, .hugetlb_entry = hmm_vma_walk_hugetlb_entry, .test_walk = hmm_vma_walk_test, + .walk_lock = PGWALK_RDLOCK, }; /** --- a/mm/ksm.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/ksm.c @@ -455,6 +455,12 @@ static int break_ksm_pmd_entry(pmd_t *pm static const struct mm_walk_ops break_ksm_ops = { .pmd_entry = break_ksm_pmd_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static const struct mm_walk_ops break_ksm_lock_vma_ops = { + .pmd_entry = break_ksm_pmd_entry, + .walk_lock = PGWALK_WRLOCK, }; /* @@ -470,16 +476,17 @@ static const struct mm_walk_ops break_ks * of the process that owns 'vma'. We also do not want to enforce * protection keys here anyway. */ -static int break_ksm(struct vm_area_struct *vma, unsigned long addr) +static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma) { vm_fault_t ret = 0; + const struct mm_walk_ops *ops = lock_vma ? + &break_ksm_lock_vma_ops : &break_ksm_ops; do { int ksm_page; cond_resched(); - ksm_page = walk_page_range_vma(vma, addr, addr + 1, - &break_ksm_ops, NULL); + ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL); if (WARN_ON_ONCE(ksm_page < 0)) return ksm_page; if (!ksm_page) @@ -565,7 +572,7 @@ static void break_cow(struct ksm_rmap_it mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (vma) - break_ksm(vma, addr); + break_ksm(vma, addr, false); mmap_read_unlock(mm); } @@ -871,7 +878,7 @@ static void remove_trailing_rmap_items(s * in cmp_and_merge_page on one of the rmap_items we would be removing. */ static int unmerge_ksm_pages(struct vm_area_struct *vma, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool lock_vma) { unsigned long addr; int err = 0; @@ -882,7 +889,7 @@ static int unmerge_ksm_pages(struct vm_a if (signal_pending(current)) err = -ERESTARTSYS; else - err = break_ksm(vma, addr); + err = break_ksm(vma, addr, lock_vma); } return err; } @@ -1029,7 +1036,7 @@ static int unmerge_and_remove_all_rmap_i if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) continue; err = unmerge_ksm_pages(vma, - vma->vm_start, vma->vm_end); + vma->vm_start, vma->vm_end, false); if (err) goto error; } @@ -2530,7 +2537,7 @@ static int __ksm_del_vma(struct vm_area_ return 0; if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end); + err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end, true); if (err) return err; } @@ -2668,7 +2675,7 @@ int ksm_madvise(struct vm_area_struct *v return 0; /* just ignore the advice */ if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, start, end); + err = unmerge_ksm_pages(vma, start, end, true); if (err) return err; } --- a/mm/madvise.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/madvise.c @@ -233,6 +233,7 @@ static int swapin_walk_pmd_entry(pmd_t * static const struct mm_walk_ops swapin_walk_ops = { .pmd_entry = swapin_walk_pmd_entry, + .walk_lock = PGWALK_RDLOCK, }; static void shmem_swapin_range(struct vm_area_struct *vma, @@ -534,6 +535,7 @@ regular_folio: static const struct mm_walk_ops cold_walk_ops = { .pmd_entry = madvise_cold_or_pageout_pte_range, + .walk_lock = PGWALK_RDLOCK, }; static void madvise_cold_page_range(struct mmu_gather *tlb, @@ -757,6 +759,7 @@ static int madvise_free_pte_range(pmd_t static const struct mm_walk_ops madvise_free_walk_ops = { .pmd_entry = madvise_free_pte_range, + .walk_lock = PGWALK_RDLOCK, }; static int madvise_free_single_vma(struct vm_area_struct *vma, --- a/mm/memcontrol.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/memcontrol.c @@ -6024,6 +6024,7 @@ static int mem_cgroup_count_precharge_pt static const struct mm_walk_ops precharge_walk_ops = { .pmd_entry = mem_cgroup_count_precharge_pte_range, + .walk_lock = PGWALK_RDLOCK, }; static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) @@ -6303,6 +6304,7 @@ put: /* get_mctgt_type() gets & locks static const struct mm_walk_ops charge_walk_ops = { .pmd_entry = mem_cgroup_move_charge_pte_range, + .walk_lock = PGWALK_RDLOCK, }; static void mem_cgroup_move_charge(void) --- a/mm/memory-failure.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/memory-failure.c @@ -831,6 +831,7 @@ static int hwpoison_hugetlb_range(pte_t static const struct mm_walk_ops hwp_walk_ops = { .pmd_entry = hwpoison_pte_range, .hugetlb_entry = hwpoison_hugetlb_range, + .walk_lock = PGWALK_RDLOCK, }; /* --- a/mm/mempolicy.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/mempolicy.c @@ -718,6 +718,14 @@ static const struct mm_walk_ops queue_pa .hugetlb_entry = queue_folios_hugetlb, .pmd_entry = queue_folios_pte_range, .test_walk = queue_pages_test_walk, + .walk_lock = PGWALK_RDLOCK, +}; + +static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = { + .hugetlb_entry = queue_folios_hugetlb, + .pmd_entry = queue_folios_pte_range, + .test_walk = queue_pages_test_walk, + .walk_lock = PGWALK_WRLOCK, }; /* @@ -738,7 +746,7 @@ static const struct mm_walk_ops queue_pa static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, nodemask_t *nodes, unsigned long flags, - struct list_head *pagelist) + struct list_head *pagelist, bool lock_vma) { int err; struct queue_pages qp = { @@ -749,8 +757,10 @@ queue_pages_range(struct mm_struct *mm, .end = end, .first = NULL, }; + const struct mm_walk_ops *ops = lock_vma ? + &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops; - err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); + err = walk_page_range(mm, start, end, ops, &qp); if (!qp.first) /* whole range in hole */ @@ -1078,7 +1088,7 @@ static int migrate_to_node(struct mm_str vma = find_vma(mm, 0); VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, - flags | MPOL_MF_DISCONTIG_OK, &pagelist); + flags | MPOL_MF_DISCONTIG_OK, &pagelist, false); if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, @@ -1321,12 +1331,8 @@ static long do_mbind(unsigned long start * Lock the VMAs before scanning for pages to migrate, to ensure we don't * miss a concurrently inserted page. */ - vma_iter_init(&vmi, mm, start); - for_each_vma_range(vmi, vma, end) - vma_start_write(vma); - ret = queue_pages_range(mm, start, end, nmask, - flags | MPOL_MF_INVERT, &pagelist); + flags | MPOL_MF_INVERT, &pagelist, true); if (ret < 0) { err = ret; --- a/mm/migrate_device.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/migrate_device.c @@ -279,6 +279,7 @@ next: static const struct mm_walk_ops migrate_vma_walk_ops = { .pmd_entry = migrate_vma_collect_pmd, .pte_hole = migrate_vma_collect_hole, + .walk_lock = PGWALK_RDLOCK, }; /* --- a/mm/mincore.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/mincore.c @@ -176,6 +176,7 @@ static const struct mm_walk_ops mincore_ .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, + .walk_lock = PGWALK_RDLOCK, }; /* --- a/mm/mlock.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/mlock.c @@ -371,6 +371,7 @@ static void mlock_vma_pages_range(struct { static const struct mm_walk_ops mlock_walk_ops = { .pmd_entry = mlock_pte_range, + .walk_lock = PGWALK_WRLOCK_VERIFY, }; /* --- a/mm/mprotect.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/mprotect.c @@ -568,6 +568,7 @@ static const struct mm_walk_ops prot_non .pte_entry = prot_none_pte_entry, .hugetlb_entry = prot_none_hugetlb_entry, .test_walk = prot_none_test, + .walk_lock = PGWALK_WRLOCK, }; int --- a/mm/pagewalk.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/pagewalk.c @@ -400,6 +400,33 @@ static int __walk_page_range(unsigned lo return err; } +static inline void process_mm_walk_lock(struct mm_struct *mm, + enum page_walk_lock walk_lock) +{ + if (walk_lock == PGWALK_RDLOCK) + mmap_assert_locked(mm); + else + mmap_assert_write_locked(mm); +} + +static inline void process_vma_walk_lock(struct vm_area_struct *vma, + enum page_walk_lock walk_lock) +{ +#ifdef CONFIG_PER_VMA_LOCK + switch (walk_lock) { + case PGWALK_WRLOCK: + vma_start_write(vma); + break; + case PGWALK_WRLOCK_VERIFY: + vma_assert_write_locked(vma); + break; + case PGWALK_RDLOCK: + /* PGWALK_RDLOCK is handled by process_mm_walk_lock */ + break; + } +#endif +} + /** * walk_page_range - walk page table with caller specific callbacks * @mm: mm_struct representing the target process of page table walk @@ -459,7 +486,7 @@ int walk_page_range(struct mm_struct *mm if (!walk.mm) return -EINVAL; - mmap_assert_locked(walk.mm); + process_mm_walk_lock(walk.mm, ops->walk_lock); vma = find_vma(walk.mm, start); do { @@ -474,6 +501,7 @@ int walk_page_range(struct mm_struct *mm if (ops->pte_hole) err = ops->pte_hole(start, next, -1, &walk); } else { /* inside vma */ + process_vma_walk_lock(vma, ops->walk_lock); walk.vma = vma; next = min(end, vma->vm_end); vma = find_vma(mm, vma->vm_end); @@ -549,7 +577,8 @@ int walk_page_range_vma(struct vm_area_s if (start < vma->vm_start || end > vma->vm_end) return -EINVAL; - mmap_assert_locked(walk.mm); + process_mm_walk_lock(walk.mm, ops->walk_lock); + process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(start, end, &walk); } @@ -566,7 +595,8 @@ int walk_page_vma(struct vm_area_struct if (!walk.mm) return -EINVAL; - mmap_assert_locked(walk.mm); + process_mm_walk_lock(walk.mm, ops->walk_lock); + process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(vma->vm_start, vma->vm_end, &walk); } --- a/mm/vmscan.c~mm-enable-page-walking-api-to-lock-vmas-during-the-walk +++ a/mm/vmscan.c @@ -4284,6 +4284,7 @@ static void walk_mm(struct lruvec *lruve static const struct mm_walk_ops mm_walk_ops = { .test_walk = should_skip_vma, .p4d_entry = walk_pud_range, + .walk_lock = PGWALK_RDLOCK, }; int err; _ Patches currently in -mm which might be from surenb(a)google.com are mm-enable-page-walking-api-to-lock-vmas-during-the-walk.patch swap-remove-remnants-of-polling-from-read_swap_cache_async.patch mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch mm-handle-swap-page-faults-under-per-vma-lock.patch mm-handle-userfaults-under-vma-lock.patch mm-handle-userfaults-under-vma-lock-fix.patch

2 years, 4 months

1
0
0 0

+ mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: David Hildenbrand <david(a)redhat.com> Subject: mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT Date: Thu, 3 Aug 2023 16:32:02 +0200 Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs… Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <david(a)redhat.com> Reported-by: liubo <liubo254(a)huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx(a)redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <mgorman(a)techsingularity.net> Acked-by: Peter Xu <peterx(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jason Gunthorpe <jgg(a)ziepe.ca> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/mm.h | 21 +++++++++++++++------ include/linux/mm_types.h | 9 +++++++++ mm/gup.c | 30 ++++++++++++++++++++++++------ mm/huge_memory.c | 3 +-- 4 files changed, 49 insertions(+), 14 deletions(-) --- a/include/linux/mm.h~mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault +++ a/include/linux/mm.h @@ -3421,15 +3421,24 @@ static inline int vm_fault_to_errno(vm_f * Indicates whether GUP can follow a PROT_NONE mapped page, or whether * a (NUMA hinting) fault is required. */ -static inline bool gup_can_follow_protnone(unsigned int flags) +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma, + unsigned int flags) { /* - * FOLL_FORCE has to be able to make progress even if the VMA is - * inaccessible. Further, FOLL_FORCE access usually does not represent - * application behaviour and we should avoid triggering NUMA hinting - * faults. + * If callers don't want to honor NUMA hinting faults, no need to + * determine if we would actually have to trigger a NUMA hinting fault. */ - return flags & FOLL_FORCE; + if (!(flags & FOLL_HONOR_NUMA_FAULT)) + return true; + + /* + * NUMA hinting faults don't apply in inaccessible (PROT_NONE) VMAs. + * + * Requiring a fault here even for inaccessible VMAs would mean that + * FOLL_FORCE cannot make any progress, because handle_mm_fault() + * refuses to process NUMA hinting faults in inaccessible VMAs. + */ + return !vma_is_accessible(vma); } typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); --- a/include/linux/mm_types.h~mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault +++ a/include/linux/mm_types.h @@ -1286,6 +1286,15 @@ enum { FOLL_PCI_P2PDMA = 1 << 10, /* allow interrupts from generic signals */ FOLL_INTERRUPTIBLE = 1 << 11, + /* + * Always honor (trigger) NUMA hinting faults. + * + * FOLL_WRITE implicitly honors NUMA hinting faults because a + * PROT_NONE-mapped page is not writable (exceptions with FOLL_FORCE + * apply). get_user_pages_fast_only() always implicitly honors NUMA + * hinting faults. + */ + FOLL_HONOR_NUMA_FAULT = 1 << 12, /* See also internal only FOLL flags in mm/internal.h */ }; --- a/mm/gup.c~mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault +++ a/mm/gup.c @@ -597,7 +597,7 @@ static struct page *follow_page_pte(stru pte = ptep_get(ptep); if (!pte_present(pte)) goto no_page; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags)) goto no_page; page = vm_normal_page(vma, address, pte); @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(stru if (likely(!pmd_trans_huge(pmdval))) return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); - if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags)) return no_page_table(vma, flags); ptl = pmd_lock(mm, pmd); @@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_ if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) return NULL; + /* + * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect + * to fail on PROT_NONE-mapped pages. + */ page = follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -2227,6 +2231,13 @@ static bool is_valid_gup_args(struct pag gup_flags |= FOLL_UNLOCKABLE; } + /* + * For now, always trigger NUMA hinting faults. Some GUP users like + * KVM require the hint to be as the calling context of GUP is + * functionally similar to a memory reference from task context. + */ + gup_flags |= FOLL_HONOR_NUMA_FAULT; + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -2551,7 +2562,14 @@ static int gup_pte_range(pmd_t pmd, pmd_ struct page *page; struct folio *folio; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + /* + * Always fallback to ordinary GUP on PROT_NONE-mapped pages: + * pte_access_permitted() better should reject these pages + * either way: otherwise, GUP-fast might succeed in + * cases where ordinary GUP would fail due to VMA access + * permissions. + */ + if (pte_protnone(pte)) goto pte_unmap; if (!pte_access_permitted(pte, flags & FOLL_WRITE)) @@ -2970,8 +2988,8 @@ static int gup_pmd_range(pud_t *pudp, pu if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || pmd_devmap(pmd))) { - if (pmd_protnone(pmd) && - !gup_can_follow_protnone(flags)) + /* See gup_pte_range() */ + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, pmdp, addr, next, flags, @@ -3151,7 +3169,7 @@ static int internal_get_user_pages_fast( if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | FOLL_FAST_ONLY | FOLL_NOFAULT | - FOLL_PCI_P2PDMA))) + FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT))) return -EINVAL; if (gup_flags & FOLL_PIN) --- a/mm/huge_memory.c~mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault +++ a/mm/huge_memory.c @@ -1467,8 +1467,7 @@ struct page *follow_trans_huge_pmd(struc if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) return ERR_PTR(-EFAULT); - /* Full NUMA hinting faults to serialise migration in fault paths */ - if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags)) return NULL; if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page)) _ Patches currently in -mm which might be from david(a)redhat.com are mm-gup-reintroduce-foll_numa-as-foll_honor_numa_fault.patch smaps-use-vm_normal_page_pmd-instead-of-follow_trans_huge_pmd.patch mm-memory_hotplug-document-the-signal_pending-check-in-offline_pages.patch kvm-explicitly-set-foll_honor_numa_fault-in-hva_to_pfn_slow.patch mm-gup-dont-implicitly-set-foll_honor_numa_fault.patch pgtable-improve-pte_protnone-comment.patch selftest-mm-ksm_functional_tests-test-in-mmap_and_merge_range-if-anything-got-merged.patch selftest-mm-ksm_functional_tests-add-prot_none-test.patch

2 years, 4 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror August 2023