[PATCH AUTOSEL 01/14] mm: fix invalid page pointer returned with FOLL_PIN gups

List overview All Threads
Download

newer

older

Hospital Message

stable-rc/queue/4.9 build: 188...

Greg Kroah-Hartman

28 Apr 2022 28 Apr '22

3:42 p.m.

From: Peter Xu peterx@redhat.com

commit 7196040e19ad634293acd3eff7083149d7669031 upstream.

Patch series "mm/gup: some cleanups", v5.

This patch (of 5):

Alex reported invalid page pointer returned with pin_user_pages_remote() from vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched pinning with struct vfio_batch").

It turns out that it's not the fault of the vfio commit; however after vfio switches to a full page buffer to store the page pointers it starts to expose the problem easier.

The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then vfio will carry on to handle the MMIO regions. However when the bug triggered, follow_page_mask() returned -EEXIST for such a page, which will jump over the current page, leaving that entry in **pages untouched. However the caller is not aware of it, hence the caller will reference the page as usual even if the pointer data can be anything.

We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn mapping unless FOLL_GET is requested") which seems very reasonable. It could be that when we reworked GUP with FOLL_PIN we could have overlooked that special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when it needs to return an -EEXIST.

Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than 1027e4436b6a.

[jhubbard@nvidia.com: added some tags, removed a reference to an out of tree module.]

Link: https://lkml.kernel.org/r/20220207062213.235127-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20220204020010.68930-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20220204020010.68930-2-jhubbard@nvidia.com Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: John Hubbard jhubbard@nvidia.com Reviewed-by: Claudio Imbrenda imbrenda@linux.ibm.com Reported-by: Alex Williamson alex.williamson@redhat.com Debugged-by: Alex Williamson alex.williamson@redhat.com Tested-by: Alex Williamson alex.williamson@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Cc: Andrea Arcangeli aarcange@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: David Hildenbrand david@redhat.com Cc: Lukas Bulwahn lukas.bulwahn@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/gup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c index 7bc1ba9ce440..41da0bd61bec 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -465,7 +465,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { /* No page to get reference */ - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return -EFAULT;

if (flags & FOLL_TOUCH) {

-- 2.36.0

Show replies by date

Greg Kroah-Hartman

28 Apr 28 Apr

3:42 p.m.

New subject: [PATCH AUTOSEL 02/14] mm: fix missing cache flush for all tail pages of compound page

From: Muchun Song songmuchun@bytedance.com

commit 2771739a7162782c0aa6424b2e3dd874e884a15d upstream.

The D-cache maintenance inside move_to_new_page() only consider one page, there is still D-cache maintenance issue for tail pages of compound page (e.g. THP or HugeTLB).

THP migration is only enabled on x86_64, ARM64 and powerpc, while powerpc and arm64 need to maintain the consistency between I-Cache and D-Cache, which depends on flush_dcache_page() to maintain the consistency between I-Cache and D-Cache.

But there is no issues on arm64 and powerpc since they already considers the compound page cache flushing in their icache flush function. HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc, riscv, s390 and sh, while arm has handled the compound page cache flush in flush_dcache_page(), but most others do not.

In theory, the issue exists on many architectures. Fix this by not using flush_dcache_folio() since it is not backportable.

Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com Fixes: 290408d4a250 ("hugetlb: hugepage migration core") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Zi Yan ziy@nvidia.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Rientjes rientjes@google.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lars Persson lars.persson@axis.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Peter Xu peterx@redhat.com Cc: Xiongchun Duan duanxiongchun@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/migrate.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c index 086a36637467..fc0e14ecd42a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -916,9 +916,12 @@ static int move_to_new_page(struct page *newpage, struct page *page, if (!PageMappingFlags(page)) page->mapping = NULL;

- if (likely(!is_zone_device_page(newpage))) - flush_dcache_page(newpage); + if (likely(!is_zone_device_page(newpage))) { + int i, nr = compound_nr(newpage);

+ for (i = 0; i < nr; i++) + flush_dcache_page(newpage + i); + } } out: return rc;

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 03/14] mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()

From: Muchun Song songmuchun@bytedance.com

commit e763243cc6cb1fcc720ec58cfd6e7c35ae90a479 upstream.

userfaultfd calls copy_huge_page_from_user() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to.

Fix this issue by flushing dcache in copy_huge_page_from_user().

Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com Fixes: fa4d75c1de13 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Rientjes rientjes@google.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lars Persson lars.persson@axis.com Cc: Peter Xu peterx@redhat.com Cc: Xiongchun Duan duanxiongchun@bytedance.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/memory.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c index b69afe3dd597..886925d97759 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5475,6 +5475,8 @@ long copy_huge_page_from_user(struct page *dst_page, if (rc) break;

+ flush_dcache_page(subpage); + cond_resched(); } return ret_val;

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 04/14] mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte()

From: Muchun Song songmuchun@bytedance.com

commit 348923665a0e50ad9fc0b3bb8127d3cb976691cc upstream.

folio_copy() will copy the data from one page to the target page, then the target page will be mapped to the user space address, which might have an alias issue with the kernel address used to copy the data from the page to. There are 2 ways to fix this issue.

1) insert flush_dcache_page() after folio_copy().

2) replace folio_copy() with copy_user_huge_page() which already considers the cache maintenance.

We chose 2) way to fix the issue since architectures can optimize this situation. It is also make backports easier.

Link: https://lkml.kernel.org/r/20220210123058.79206-5-songmuchun@bytedance.com Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Rientjes rientjes@google.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lars Persson lars.persson@axis.com Cc: Peter Xu peterx@redhat.com Cc: Xiongchun Duan duanxiongchun@bytedance.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/hugetlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a1da8757cc9c..e2dc190c6725 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5820,7 +5820,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; goto out; } - folio_copy(page_folio(page), page_folio(*pagep)); + copy_user_huge_page(page, *pagep, dst_addr, dst_vma, + pages_per_huge_page(h)); put_page(*pagep); *pagep = NULL; }

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 05/14] mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()

From: Muchun Song songmuchun@bytedance.com

commit 19b482c29b6f3805f1d8e93015847b89e2f7f3b1 upstream.

userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Insert flush_dcache_page() in non-zero-page case. And replace clear_highpage() with clear_user_highpage() which already considers the cache maintenance.

Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com Fixes: 8d1039634206 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support") Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Rientjes rientjes@google.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lars Persson lars.persson@axis.com Cc: Peter Xu peterx@redhat.com Cc: Xiongchun Duan duanxiongchun@bytedance.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/shmem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c index a09b29ec2b45..7a46419d331d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2357,8 +2357,10 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* don't free the page */ goto out_unacct_blocks; } + + flush_dcache_page(page); } else { /* ZEROPAGE */ - clear_highpage(page); + clear_user_highpage(page, dst_addr); } } else { page = *pagep;

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 06/14] mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()

From: Muchun Song songmuchun@bytedance.com

commit 7c25a0b89a487878b0691e6524fb5a8827322194 upstream.

userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this by insert flush_dcache_page() after copy_from_user() succeeds.

Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com Fixes: b6ebaedb4cb1 ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic") Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Muchun Song songmuchun@bytedance.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Rientjes rientjes@google.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lars Persson lars.persson@axis.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Peter Xu peterx@redhat.com Cc: Xiongchun Duan duanxiongchun@bytedance.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/userfaultfd.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 885e5adb0168..7259f96faaa0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -153,6 +153,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, /* don't free the page */ goto out; } + + flush_dcache_page(page); } else { page = *pagep; *pagep = NULL; @@ -628,6 +630,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = -EFAULT; goto out; } + flush_dcache_page(page); goto retry; } else BUG_ON(page);

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 07/14] mm/page_alloc: fetch the correct pcp buddy during bulk free

From: Mel Gorman mgorman@techsingularity.net

commit ca7b59b1de72450b3e696bada3506a519ac5455c upstream.

Patch series "Follow-up on high-order PCP caching", v2.

Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") was primarily aimed at reducing the cost of SLUB cache refills of high-order pages in two ways. Firstly, zone lock acquisitions was reduced and secondly, there were fewer buddy list modifications. This is a follow-up series fixing some issues that became apparant after merging.

Patch 1 is a functional fix. It's harmless but inefficient.

Patches 2-5 reduce the overhead of bulk freeing of PCP pages. While the overhead is small, it's cumulative and noticable when truncating large files. The changelog for patch 4 includes results of a microbench that deletes large sparse files with data in page cache. Sparse files were used to eliminate filesystem overhead.

Patch 6 addresses issues with high-order PCP pages being stored on PCP lists for too long. Pages freed on a CPU potentially may not be quickly reused and in some cases this can increase cache miss rates. Details are included in the changelog.

This patch (of 6):

free_pcppages_bulk() prefetches buddies about to be freed but the order must also be passed in as PCP lists store multiple orders.

Link: https://lkml.kernel.org/r/20220217002227.5739-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20220217002227.5739-2-mgorman@techsingularity.net Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Mel Gorman mgorman@techsingularity.net Reviewed-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Aaron Lu aaron.lu@intel.com Tested-by: Aaron Lu aaron.lu@intel.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Michal Hocko mhocko@kernel.org Cc: Jesper Dangaard Brouer brouer@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e6f211dcf82e..b2ef0e75fd29 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1432,10 +1432,10 @@ static bool bulkfree_pcp_prepare(struct page *page) } #endif /* CONFIG_DEBUG_VM */

-static inline void prefetch_buddy(struct page *page) +static inline void prefetch_buddy(struct page *page, unsigned int order) { unsigned long pfn = page_to_pfn(page); - unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0); + unsigned long buddy_pfn = __find_buddy_pfn(pfn, order); struct page *buddy = page + (buddy_pfn - pfn);

prefetch(buddy); @@ -1512,7 +1512,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, * prefetch buddy for the first pcp->batch nr of pages. */ if (prefetch_nr) { - prefetch_buddy(page); + prefetch_buddy(page, order); prefetch_nr--; } } while (count > 0 && --batch_free && !list_empty(list));

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 08/14] mm/page_alloc: check high-order pages for corruption during PCP operations

From: Mel Gorman mgorman@techsingularity.net

commit 77fe7f136a7312954b1b8b7eeb4bc91fc3c14a3f upstream.

Eric Dumazet pointed out that commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") only checks the head page during PCP refill and allocation operations. This was an oversight and all pages should be checked. This will incur a small performance penalty but it's necessary for correctness.

Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Mel Gorman mgorman@techsingularity.net Reported-by: Eric Dumazet edumazet@google.com Acked-by: Eric Dumazet edumazet@google.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: David Rientjes rientjes@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Wei Xu weixugc@google.com Cc: Greg Thelen gthelen@google.com Cc: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/page_alloc.c | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b2ef0e75fd29..adceee44adf6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2342,23 +2342,36 @@ static inline int check_new_page(struct page *page) return 1; }

+static bool check_new_pages(struct page *page, unsigned int order) +{ + int i; + for (i = 0; i < (1 << order); i++) { + struct page *p = page + i; + + if (unlikely(check_new_page(p))) + return true; + } + + return false; +} + #ifdef CONFIG_DEBUG_VM /* * With DEBUG_VM enabled, order-0 pages are checked for expected state when * being allocated from pcp lists. With debug_pagealloc also enabled, they are * also checked when pcp lists are refilled from the free lists. */ -static inline bool check_pcp_refill(struct page *page) +static inline bool check_pcp_refill(struct page *page, unsigned int order) { if (debug_pagealloc_enabled_static()) - return check_new_page(page); + return check_new_pages(page, order); else return false; }

-static inline bool check_new_pcp(struct page *page) +static inline bool check_new_pcp(struct page *page, unsigned int order) { - return check_new_page(page); + return check_new_pages(page, order); } #else /* @@ -2366,32 +2379,19 @@ static inline bool check_new_pcp(struct page *page) * when pcp lists are being refilled from the free lists. With debug_pagealloc * enabled, they are also checked when being allocated from the pcp lists. */ -static inline bool check_pcp_refill(struct page *page) +static inline bool check_pcp_refill(struct page *page, unsigned int order) { - return check_new_page(page); + return check_new_pages(page, order); } -static inline bool check_new_pcp(struct page *page) +static inline bool check_new_pcp(struct page *page, unsigned int order) { if (debug_pagealloc_enabled_static()) - return check_new_page(page); + return check_new_pages(page, order); else return false; } #endif /* CONFIG_DEBUG_VM */

-static bool check_new_pages(struct page *page, unsigned int order) -{ - int i; - for (i = 0; i < (1 << order); i++) { - struct page *p = page + i; - - if (unlikely(check_new_page(p))) - return true; - } - - return false; -} - inline void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags) { @@ -3037,7 +3037,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, if (unlikely(page == NULL)) break;

- if (unlikely(check_pcp_refill(page))) + if (unlikely(check_pcp_refill(page, order))) continue;

/* @@ -3641,7 +3641,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, page = list_first_entry(list, struct page, lru); list_del(&page->lru); pcp->count -= 1 << order; - } while (check_new_pcp(page)); + } while (check_new_pcp(page, order));

return page; }

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 09/14] mm/hwpoison: fix error page recovered but reported "not recovered"

From: Naoya Horiguchi naoya.horiguchi@nec.com

commit 046545a661af2beec21de7b90ca0e35f05088a81 upstream.

When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure() and the machine check processing code finds the page already poisoned. It calls kill_accessing_process() to make sure a SIGBUS is sent. But returns the wrong error code.

Console log looks like this:

mce: Uncorrected hardware memory error in user-access at 3710b3400 Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered Memory failure: 0x3710b3: already hardware poisoned Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption mce: Memory error not recovered

kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already set to the process and kill_me_maybe() doesn't have to send it again. But current code simply fails to do this, so fix it to make sure to work as intended. This change avoids the noise message "Memory error not recovered" and skips duplicate SIGBUSs.

[tony.luck@intel.com: reword some parts of commit message]

Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Reported-by: Youquan Song youquan.song@intel.com Cc: Tony Luck tony.luck@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/memory-failure.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 15dcedbc1730..682eedb5ea75 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -707,8 +707,10 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, (void *)&priv); if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); + else + ret = 0; mmap_read_unlock(p->mm); - return ret ? -EFAULT : -EHWPOISON; + return ret > 0 ? -EHWPOISON : -EFAULT; }

static const char *action_name[] = {

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 10/14] mm/mlock: fix potential imbalanced rlimit ucounts adjustment

From: Miaohe Lin linmiaohe@huawei.com

commit 5c2a956c3eea173b2bc89f632507c0eeaebf6c4a upstream.

user_shm_lock forgets to set allowed to 0 when get_ucounts fails. So the later user_shm_unlock might do the extra dec_rlimit_ucounts. Fix this by resetting allowed to 0.

Link: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Andrew Morton akpm@linux-foundation.org Acked-by: Hugh Dickins hughd@google.com Cc: Herbert van den Bergh herbert.van.den.bergh@oracle.com Cc: Chris Mason chris.mason@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mlock.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/mm/mlock.c b/mm/mlock.c index 37f969ec68fa..b565b1aac8d4 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -838,6 +838,7 @@ int user_shm_lock(size_t size, struct ucounts *ucounts) } if (!get_ucounts(ucounts)) { dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); + allowed = 0; goto out; } allowed = 1;

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 11/14] mm,migrate: fix establishing demotion target

From: Huang Ying ying.huang@intel.com

commit fc89213a636c3735eb3386f10a34c082271b4192 upstream.

In commit ac16ec835314 ("mm: migrate: support multiple target nodes demotion"), after the first demotion target node is found, we will continue to check the next candidate obtained via find_next_best_node(). This is to find all demotion target nodes with same NUMA distance. But one side effect of find_next_best_node() is that the candidate node returned will be set in "used" parameter, even if the candidate node isn't passed in the following NUMA distance checking, the candidate node will not be used as demotion target node for the following nodes. For example, for system as follows,

node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10

when we establish demotion target node for node 0, in the first round node 2 is added to the demotion target node set. Then in the second round, node 3 is checked and failed because distance(0, 3) > distance(0, 2). But node 3 is set in "used" nodemask too. When we establish demotion target node for node 1, there is no available node. This is wrong, node 3 should be set as the demotion target of node 1.

To fix this, if the candidate node is failed to pass the distance checking, it will be cleared in "used" nodemask. So that it can be used for the following node.

The bug can be reproduced and fixed with this patch on a 2 socket server machine with DRAM and PMEM.

Link: https://lkml.kernel.org/r/20220128055940.1792614-1-ying.huang@intel.com Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion") Signed-off-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Zi Yan ziy@nvidia.com Cc: Oscar Salvador osalvador@suse.de Cc: Yang Shi shy828301@gmail.com Cc: zhongjiang-ali zhongjiang-ali@linux.alibaba.com Cc: Xunlei Pang xlpang@linux.alibaba.com Cc: Mel Gorman mgorman@techsingularity.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/migrate.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c index fc0e14ecd42a..ac7673e43dda 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3085,18 +3085,21 @@ static int establish_migrate_target(int node, nodemask_t *used, if (best_distance != -1) { val = node_distance(node, migration_target); if (val > best_distance) - return NUMA_NO_NODE; + goto out_clear; }

index = nd->nr; if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, "Exceeds maximum demotion target nodes\n")) - return NUMA_NO_NODE; + goto out_clear;

nd->nodes[index] = migration_target; nd->nr++;

return migration_target; +out_clear: + node_clear(migration_target, *used); + return NUMA_NO_NODE; }

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 12/14] mm/thp: refix __split_huge_pmd_locked() for migration PMD

From: Hugh Dickins hughd@google.com

commit 9d84604b845c3888d1bede43d16ab3ebedb13e24 upstream.

Migration entries do not contribute to a page's reference count: move __split_huge_pmd_locked()'s page_ref_add() into pmd_migration's else block (along with the page_count() check - a page is quite likely to have reference count frozen to 0 when a migration entry is found).

This will fix a very rare anonymous memory leak, after a split_huge_pmd() raced with an anon split_huge_page() or an anon THP migrate_pages(): since the wrongly raised refcount stopped the page (perhaps small, perhaps huge, depending on when the race hit) from ever being freed.

At first I thought there were worse risks, from prematurely unfreezing a frozen page: but now think that would only affect page cache pages, which do not come this way (except for anonymous pages in swap cache, perhaps).

Link: https://lkml.kernel.org/r/84792468-f512-e48f-378c-e34c3641e97@google.com Fixes: ec0abae6dcdf ("mm/thp: fix __split_huge_pmd_locked() for migration PMD") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Zi Yan ziy@nvidia.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..468fca576bc2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2055,9 +2055,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, young = pmd_young(old_pmd); soft_dirty = pmd_soft_dirty(old_pmd); uffd_wp = pmd_uffd_wp(old_pmd); + VM_BUG_ON_PAGE(!page_count(page), page); + page_ref_add(page, HPAGE_PMD_NR - 1); } - VM_BUG_ON_PAGE(!page_count(page), page); - page_ref_add(page, HPAGE_PMD_NR - 1);

/* * Withdraw the table only after we mark the pmd entry invalid.

-- 2.36.0

Greg Kroah-Hartman

3:42 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/rmap.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c index 9e27f9f038d3..444d0d958aff 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1252,6 +1252,17 @@ void page_add_file_rmap(struct page *page, bool compound) } if (!atomic_inc_and_test(compound_mapcount_ptr(page))) goto out; + + /* + * It is racy to ClearPageDoubleMap in page_remove_file_rmap(); + * but page lock is held by all page_add_file_rmap() compound + * callers, and SetPageDoubleMap below warns if !PageLocked: + * so here is a place that DoubleMap can be safely cleared. + */ + VM_WARN_ON_ONCE(!PageLocked(page)); + if (nr == nr_pages && PageDoubleMap(page)) + ClearPageDoubleMap(page); + if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, nr_pages);

-- 2.36.0

Hugh Dickins

4:51 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...

From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

(Whereas, if a developer asks for such a patch to be added to stable later, and verifies the result, that's of course a different matter.)

I've chosen to answer to this patch of my 3 in your 14 AUTOSELs, because this one is just an improvement, not at all a bugfix needed for stable (maybe AUTOSEL noticed "racy" or "safely" in the comments, and misunderstood). The "Fixes" was intended to help any humans who wanted to backport into their trees.

I do recall that this 13/14, and 14/14, are mods to mm/rmap.c which followed other (mm/munlock) mods to mm/rmap.c in 5.18-rc1, which affected the out path of the function involved, and somehow made 14/14 a little cleaner. I'm sorry, but I just don't rate it worth my time at the moment, to verify whether 14/14 happens to have ended up as a correct patch or not.

And nobody can verify them without these AUTOSELs saying to which tree they are targeted - 5.17 I suppose.

Hugh

...

mm/rmap.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c index 9e27f9f038d3..444d0d958aff 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1252,6 +1252,17 @@ void page_add_file_rmap(struct page *page, bool compound) } if (!atomic_inc_and_test(compound_mapcount_ptr(page))) goto out;
/*
 * It is racy to ClearPageDoubleMap in page_remove_file_rmap();
 * but page lock is held by all page_add_file_rmap() compound
 * callers, and SetPageDoubleMap below warns if !PageLocked:
 * so here is a place that DoubleMap can be safely cleared.
 */
VM_WARN_ON_ONCE(!PageLocked(page));
if (nr == nr_pages && PageDoubleMap(page))
	ClearPageDoubleMap(page);
if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, nr_pages);
-- 2.36.0

Greg Kroah-Hartman

4:58 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

On Thu, Apr 28, 2022 at 09:51:58AM -0700, Hugh Dickins wrote:

...

On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

True, this was my attempt at saying "hey these all look like they should go to stable trees, why not?"

...

I've chosen to answer to this patch of my 3 in your 14 AUTOSELs, because this one is just an improvement, not at all a bugfix needed for stable (maybe AUTOSEL noticed "racy" or "safely" in the comments, and misunderstood). The "Fixes" was intended to help any humans who wanted to backport into their trees.

This all was off of the Fixes: tag. Again, if these commits fix something why are they not for stable? I'm a human asking to backport these into the stable trees based on that :)

...

I do recall that this 13/14, and 14/14, are mods to mm/rmap.c which followed other (mm/munlock) mods to mm/rmap.c in 5.18-rc1, which affected the out path of the function involved, and somehow made 14/14 a little cleaner. I'm sorry, but I just don't rate it worth my time at the moment, to verify whether 14/14 happens to have ended up as a correct patch or not.

And nobody can verify them without these AUTOSELs saying to which tree they are targeted - 5.17 I suppose.

5.17 to start with, older ones based on where the Fixes: tags went to.

So do you really want me to drop these? I will but why are you adding fixes: tags if you don't want people to take them?

thanks,

greg k-h

Hugh Dickins

7:27 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...

On Thu, Apr 28, 2022 at 09:51:58AM -0700, Hugh Dickins wrote:

...
On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

True, this was my attempt at saying "hey these all look like they should go to stable trees, why not?"

Okay, it seems I should have read "AUTOSEL" as "Hey, GregKH here, these all look like they should go to stable trees, why not?", which would have drawn a friendlier response.

The answer is that I considered stable at the time, and akpm did too, and none of my three (I've not looked through the other 11) are serious enough to be needed in stable; and I'm cautious about backports, because I know that the tree they went on top of differs thereabouts from 5.17.

Of course I think the patches in 5.18-rc are good, and yes, they're things I've thought worthwhile enough for me personally to port forward over several releases until I had time to send in. But that doesn't make them safe stable candidates, without someone to verify and vouch for the results in this or that tree - I run on a much slower clock than you and most around here, I do not have time for that at present (and would prefer not even to be having this conversation).

But I'm happily overruled if any mm guys think they are worth that extra effort, and will verify and vouch for them.

...

...
I've chosen to answer to this patch of my 3 in your 14 AUTOSELs, because this one is just an improvement, not at all a bugfix needed for stable (maybe AUTOSEL noticed "racy" or "safely" in the comments, and misunderstood). The "Fixes" was intended to help any humans who wanted to backport into their trees.

This all was off of the Fixes: tag. Again, if these commits fix something why are they not for stable? I'm a human asking to backport these into the stable trees based on that :)

Your humanity is not in doubt :) But I think we've gone over this too many times - each year? There's a "Fixes:" tag and "Cc: stable" tag, and in akpm's tree we prefer to be able to specify "Fixes:" to help each other, without that automatically implying "Cc: stable". Andrew goes to considerable trouble to determine when "Cc: stable" is appropriate.

...

...
I do recall that this 13/14, and 14/14, are mods to mm/rmap.c which followed other (mm/munlock) mods to mm/rmap.c in 5.18-rc1, which affected the out path of the function involved, and somehow made 14/14 a little cleaner. I'm sorry, but I just don't rate it worth my time at the moment, to verify whether 14/14 happens to have ended up as a correct patch or not.

And nobody can verify them without these AUTOSELs saying to which tree they are targeted - 5.17 I suppose.

5.17 to start with, older ones based on where the Fixes: tags went to.

So do you really want me to drop these? I will but why are you adding fixes: tags if you don't want people to take them?

Yes, please drop them - thanks. As to the other 11: I hope authors will speak up one way or the other, but I'll drop out now.

Hugh

Sean Christopherson

10:45 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

+Sasha and Paolo

On Thu, Apr 28, 2022, Hugh Dickins wrote:

...

On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
On Thu, Apr 28, 2022 at 09:51:58AM -0700, Hugh Dickins wrote:

...
On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

True, this was my attempt at saying "hey these all look like they should go to stable trees, why not?"

Okay, it seems I should have read "AUTOSEL" as "Hey, GregKH here, these all look like they should go to stable trees, why not?", which would have drawn a friendlier response.

FWIW, Sasha has been using MANUALSEL for the KVM tree to solicit an explicit ACK from Paolo for these types of patches. AFAICT, it has been working quite well.

Greg Kroah-Hartman

29 Apr 29 Apr

12:13 p.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

On Thu, Apr 28, 2022 at 10:45:18PM +0000, Sean Christopherson wrote:

...

+Sasha and Paolo

On Thu, Apr 28, 2022, Hugh Dickins wrote:

...
On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
On Thu, Apr 28, 2022 at 09:51:58AM -0700, Hugh Dickins wrote:

...
On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

True, this was my attempt at saying "hey these all look like they should go to stable trees, why not?"

Okay, it seems I should have read "AUTOSEL" as "Hey, GregKH here, these all look like they should go to stable trees, why not?", which would have drawn a friendlier response.

FWIW, Sasha has been using MANUALSEL for the KVM tree to solicit an explicit ACK from Paolo for these types of patches. AFAICT, it has been working quite well.

Yes, that is what I should have put here, sorry about that. These were manually picked by me and I am asking if they should be included or not. I'll resend after dropping Hugh's patches from the series.

thanks,

greg k-h

Sasha Levin

30 Apr 30 Apr

12:27 a.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

On Thu, Apr 28, 2022 at 12:27:40PM -0700, Hugh Dickins wrote:

...

On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
On Thu, Apr 28, 2022 at 09:51:58AM -0700, Hugh Dickins wrote:

...
On Thu, 28 Apr 2022, Greg Kroah-Hartman wrote:

...
From: Hugh Dickins hughd@google.com

commit bd55b0c2d64e84a75575f548a33a3dfecc135b65 upstream.

PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted.

See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linu...

But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches.

Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

NAK.

I thought we had a long-standing agreement that AUTOSEL does not try to add patches from akpm's tree which had not been marked for stable.

I guess it was only between myself and mm/ :p

...

...
True, this was my attempt at saying "hey these all look like they should go to stable trees, why not?"

Okay, it seems I should have read "AUTOSEL" as "Hey, GregKH here, these all look like they should go to stable trees, why not?", which would have drawn a friendlier response.

FRIENDLYGREGBOT :)

...

The answer is that I considered stable at the time, and akpm did too, and none of my three (I've not looked through the other 11) are serious enough to be needed in stable; and I'm cautious about backports, because I know that the tree they went on top of differs thereabouts from 5.17.

Of course I think the patches in 5.18-rc are good, and yes, they're things I've thought worthwhile enough for me personally to port forward over several releases until I had time to send in. But that doesn't make them safe stable candidates, without someone to verify and vouch for the results in this or that tree - I run on a much slower clock than you and most around here, I do not have time for that at present (and would prefer not even to be having this conversation).

But I'm happily overruled if any mm guys think they are worth that extra effort, and will verify and vouch for them.

What's the extra effort here? We're seeing so many cases where we see issues with LTS kernels and we end up spending so much time triaging and diagnosing them only to find out that they've already been fixed.

Honesly, having them in -stable seems like *less* effort to me.

-- Thanks, Sasha

Pavel Machek

2 May 2 May

8:45 a.m.

New subject: [PATCH AUTOSEL 13/14] mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

Hi!

...

...
I've chosen to answer to this patch of my 3 in your 14 AUTOSELs, because this one is just an improvement, not at all a bugfix needed for stable (maybe AUTOSEL noticed "racy" or "safely" in the comments, and misunderstood). The "Fixes" was intended to help any humans who wanted to backport into their trees.

This all was off of the Fixes: tag. Again, if these commits fix something why are they not for stable? I'm a human asking to backport these into the stable trees based on that :)

I see this as a repeated pattern: People add Fixes: tag for trivial things that should not really go to stable (typo in comment?) and stable takes it is a serious bug that needs to be fixed in stable.

Best regards, Pavel

Greg Kroah-Hartman

28 Apr 28 Apr

3:42 p.m.

New subject: [PATCH AUTOSEL 14/14] mm/thp: fix NR_FILE_MAPPED accounting in page_*_file_rmap()

From: Hugh Dickins hughd@google.com

commit 5d543f13e2f5580828de885c751d68a35b6a493d upstream.

NR_FILE_MAPPED accounting in mm/rmap.c (for /proc/meminfo "Mapped" and /proc/vmstat "nr_mapped" and the memcg's memory.stat "mapped_file") is slightly flawed for file or shmem huge pages.

It is well thought out, and looks convincing, but there's a racy case when the careful counting in page_remove_file_rmap() (without page lock) gets discarded. So that in a workload like two "make -j20" kernel builds under memory pressure, with cc1 on hugepage text, "Mapped" can easily grow by a spurious 5MB or more on each iteration, ending up implausibly bigger than most other numbers in /proc/meminfo. And, hypothetically, might grow to the point of seriously interfering in mm/vmscan.c's heuristics, which do take NR_FILE_MAPPED into some consideration.

Fixed by moving the __mod_lruvec_page_state() down to where it will not be missed before return (and I've grown a bit tired of that oft-repeated but-not-everywhere comment on the __ness: it gets lost in the move here).

Does page_add_file_rmap() need the same change? I suspect not, because page lock is held in all relevant cases, and its skipping case looks safe; but it's much easier to be sure, if we do make the same change.

Link: https://lkml.kernel.org/r/e02e52a1-8550-a57c-ed29-f51191ea2375@google.com Fixes: dd78fedde4b9 ("rmap: support file thp") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/rmap.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c index 444d0d958aff..fa09b5eaff34 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1239,14 +1239,14 @@ void page_add_new_anon_rmap(struct page *page, */ void page_add_file_rmap(struct page *page, bool compound) { - int i, nr = 1; + int i, nr = 0;

VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); lock_page_memcg(page); if (compound && PageTransHuge(page)) { int nr_pages = thp_nr_pages(page);

- for (i = 0, nr = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages; i++) { if (atomic_inc_and_test(&page[i]._mapcount)) nr++; } @@ -1279,17 +1279,18 @@ void page_add_file_rmap(struct page *page, bool compound) if (PageMlocked(page)) clear_page_mlock(head); } - if (!atomic_inc_and_test(&page->_mapcount)) - goto out; + if (atomic_inc_and_test(&page->_mapcount)) + nr++; } - __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); out: + if (nr) + __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); unlock_page_memcg(page); }

static void page_remove_file_rmap(struct page *page, bool compound) { - int i, nr = 1; + int i, nr = 0;

VM_BUG_ON_PAGE(compound && !PageHead(page), page);

@@ -1304,12 +1305,12 @@ static void page_remove_file_rmap(struct page *page, bool compound) if (compound && PageTransHuge(page)) { int nr_pages = thp_nr_pages(page);

- for (i = 0, nr = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages; i++) { if (atomic_add_negative(-1, &page[i]._mapcount)) nr++; } if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) - return; + goto out; if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, -nr_pages); @@ -1317,16 +1318,13 @@ static void page_remove_file_rmap(struct page *page, bool compound) __mod_lruvec_page_state(page, NR_FILE_PMDMAPPED, -nr_pages); } else { - if (!atomic_add_negative(-1, &page->_mapcount)) - return; + if (atomic_add_negative(-1, &page->_mapcount)) + nr++; }

- /* - * We use the irq-unsafe __{inc|mod}_lruvec_page_state because - * these counters are not modified in interrupt context, and - * pte lock(a spinlock) is held, which implies preemption disabled. - */ - __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr); +out: + if (nr) + __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);

if (unlikely(PageMlocked(page))) clear_page_mlock(page);

-- 2.36.0

1297

days inactive

1301

days old

linux-stable-mirror@lists.linaro.org

20 comments

participants

tags (0)

participants (5)

Greg Kroah-Hartman
Hugh Dickins
Pavel Machek
Sasha Levin
Sean Christopherson