March 2022 - Linux-stable-mirror

[merged] mm-madvise-return-correct-bytes-advised-with-process_madvise.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mm: madvise: return correct bytes advised with process_madvise has been removed from the -mm tree. Its filename was mm-madvise-return-correct-bytes-advised-with-process_madvise.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Charan Teja Kalla <quic_charante(a)quicinc.com> Subject: mm: madvise: return correct bytes advised with process_madvise Patch series "mm: madvise: return correct bytes processed with process_madvise", v2. With the process_madvise(), always choose to return non zero processed bytes over an error. This can help the user to know on which VMA, passed in the 'struct iovec' vector list, is failed to advise thus can take the decission of retrying/skipping on that VMA. This patch (of 2): The process_madvise() system call returns error even after processing some VMA's passed in the 'struct iovec' vector list which leaves the user confused to know where to restart the advise next. It is also against this syscall man page[1] documentation where it mentions that "return value may be less than the total number of requested bytes, if an error occurred after some iovec elements were already processed.". Consider a user passed 10 VMA's in the 'struct iovec' vector list of which 9 are processed but one. Then it just returns the error caused on that failed VMA despite the first 9 VMA's processed, leaving the user confused about on which VMA it is failed. Returning the number of bytes processed here can help the user to know which VMA it is failed on and thus can retry/skip the advise on that VMA. [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html. Link: https://lkml.kernel.org/r/cover.1647008754.git.quic_charante@quicinc.com Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.16470087… Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: David Rientjes <rientjes(a)google.com> Cc: Stephen Rothwell <sfr(a)canb.auug.org.au> Cc: Minchan Kim <minchan(a)kernel.org> Cc: Nadav Amit <nadav.amit(a)gmail.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/madvise.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/mm/madvise.c~mm-madvise-return-correct-bytes-advised-with-process_madvise +++ a/mm/madvise.c @@ -1435,8 +1435,7 @@ SYSCALL_DEFINE5(process_madvise, int, pi iov_iter_advance(&iter, iovec.iov_len); } - if (ret == 0) - ret = total_len - iov_iter_count(&iter); + ret = (total_len - iov_iter_count(&iter)) ? : ret; release_mm: mmput(mm); _ Patches currently in -mm which might be from quic_charante(a)quicinc.com are

3 years, 7 months

1
0
0 0

[merged] mempolicy-mbind_range-set_policy-after-vma_merge.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mempolicy: mbind_range() set_policy() after vma_merge() has been removed from the -mm tree. Its filename was mempolicy-mbind_range-set_policy-after-vma_merge.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Hugh Dickins <hughd(a)google.com> Subject: mempolicy: mbind_range() set_policy() after vma_merge() v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced vma_merge() to mbind_range(); but unlike madvise, mlock and mprotect, it put a "continue" to next vma where its precedents go to update flags on current vma before advancing: that left vma with the wrong setting in the infamous vma_merge() case 8. v3.10 commit 1444f92c8498 ("mm: merging memory blocks resets mempolicy") tried to fix that in vma_adjust(), without fully understanding the issue. v3.11 commit 3964acd0dbec ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") reverted that, and went about the fix in the right way, but chose to optimize out an unnecessary mpol_dup() with a prior mpol_equal() test. But on tmpfs, that also pessimized out the vital call to its ->set_policy(), leaving the new mbind unenforced. The user visible effect was that the pages got allocated on the local node (happened to be 0), after the mbind() caller had specifically asked for them to be allocated on node 1. There was not any page migration involved in the case reported: the pages simply got allocated on the wrong node. Just delete that optimization now (though it could be made conditional on vma not having a set_policy). Also remove the "next" variable: it turned out to be blameless, but also pointless. Link: https://lkml.kernel.org/r/319e4db9-64ae-4bca-92f0-ade85d342ff@google.com Fixes: 3964acd0dbec ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") Signed-off-by: Hugh Dickins <hughd(a)google.com> Acked-by: Oleg Nesterov <oleg(a)redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/mempolicy.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) --- a/mm/mempolicy.c~mempolicy-mbind_range-set_policy-after-vma_merge +++ a/mm/mempolicy.c @@ -786,7 +786,6 @@ static int vma_replace_policy(struct vm_ static int mbind_range(struct mm_struct *mm, unsigned long start, unsigned long end, struct mempolicy *new_pol) { - struct vm_area_struct *next; struct vm_area_struct *prev; struct vm_area_struct *vma; int err = 0; @@ -801,8 +800,7 @@ static int mbind_range(struct mm_struct if (start > vma->vm_start) prev = vma; - for (; vma && vma->vm_start < end; prev = vma, vma = next) { - next = vma->vm_next; + for (; vma && vma->vm_start < end; prev = vma, vma = vma->vm_next) { vmstart = max(start, vma->vm_start); vmend = min(end, vma->vm_end); @@ -817,10 +815,6 @@ static int mbind_range(struct mm_struct anon_vma_name(vma)); if (prev) { vma = prev; - next = vma->vm_next; - if (mpol_equal(vma_policy(vma), new_pol)) - continue; - /* vma_merge() joined vma && vma->next, case 8 */ goto replace; } if (vma->vm_start != vmstart) { _ Patches currently in -mm which might be from hughd(a)google.com are mm-delete-__clearpagewaiters.patch mm-filemap_unaccount_folio-large-skip-mapcount-fixup.patch mm-thp-fix-nr_file_mapped-accounting-in-page__file_rmap.patch mm-warn-on-deleting-redirtied-only-if-accounted.patch mm-unmap_mapping_range_tree-with-i_mmap_rwsem-shared.patch

3 years, 7 months

1
0
0 0

[merged] mm-clean-up-hwpoison-page-cache-page-in-fault-path.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mm: invalidate hwpoison page cache page in fault path has been removed from the -mm tree. Its filename was mm-clean-up-hwpoison-page-cache-page-in-fault-path.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Rik van Riel <riel(a)surriel.com> Subject: mm: invalidate hwpoison page cache page in fault path Sometimes the page offlining code can leave behind a hwpoisoned clean page cache page. This can lead to programs being killed over and over and over again as they fault in the hwpoisoned page, get killed, and then get re-spawned by whatever wanted to run them. This is particularly embarrassing when the page was offlined due to having too many corrected memory errors. Now we are killing tasks due to them trying to access memory that probably isn't even corrupted. This problem can be avoided by invalidating the page from the page fault handler, which already has a branch for dealing with these kinds of pages. With this patch we simply pretend the page fault was successful if the page was invalidated, return to userspace, incur another page fault, read in the file from disk (to a new memory page), and then everything works again. Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com Signed-off-by: Rik van Riel <riel(a)surriel.com> Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.de> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) --- a/mm/memory.c~mm-clean-up-hwpoison-page-cache-page-in-fault-path +++ a/mm/memory.c @@ -3877,11 +3877,16 @@ static vm_fault_t __do_fault(struct vm_f return ret; if (unlikely(PageHWPoison(vmf->page))) { - if (ret & VM_FAULT_LOCKED) + vm_fault_t poisonret = VM_FAULT_HWPOISON; + if (ret & VM_FAULT_LOCKED) { + /* Retry if a clean page was removed from the cache. */ + if (invalidate_inode_page(vmf->page)) + poisonret = 0; unlock_page(vmf->page); + } put_page(vmf->page); vmf->page = NULL; - return VM_FAULT_HWPOISON; + return poisonret; } if (unlikely(!(ret & VM_FAULT_LOCKED))) _ Patches currently in -mm which might be from riel(a)surriel.com are

3 years, 7 months

1
0
0 0

[merged] mm-pages_allocc-dont-create-zone_movable-beyond-the-end-of-a-node.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node has been removed from the -mm tree. Its filename was mm-pages_allocc-dont-create-zone_movable-beyond-the-end-of-a-node.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Alistair Popple <apopple(a)nvidia.com> Subject: mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node ZONE_MOVABLE uses the remaining memory in each node. Its starting pfn is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is not enough room for ZONE_MOVABLE on that node. Unfortunately this condition is not checked for. This leads to zone_movable_pfn[] getting set to a pfn greater than the last pfn in a node. calculate_node_totalpages() then sets zone->present_pages to be greater than zone->spanned_pages which is invalid, as spanned_pages represents the maximum number of pages in a zone assuming no holes. Subsequently it is possible free_area_init_core() will observe a zone of size zero with present pages. In this case it will skip setting up the zone, including the initialisation of free_lists[]. However populated_zone() checks zone->present_pages to see if a zone has memory available. This is used by iterators such as walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the free_list of each zone in each node, which are assumed to be initialised due to the zone not being empty. As free_area_init_core() never initialised the free_lists[] this results in the following kernel crash when trying to read /proc/pagetypeinfo: [ 67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 67.535429] #PF: supervisor read access in kernel mode [ 67.535789] #PF: error_code(0x0000) - not-present page [ 67.536128] PGD 0 P4D 0 [ 67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI [ 67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461 [ 67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460 [ 67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82 [ 67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003 [ 67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001 [ 67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b [ 67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e [ 67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00 [ 67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000 [ 67.538259] FS: 00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 67.538259] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0 [ 67.538259] Call Trace: [ 67.538259] <TASK> [ 67.538259] seq_read_iter+0x128/0x460 [ 67.538259] ? aa_file_perm+0x1af/0x5f0 [ 67.538259] proc_reg_read_iter+0x51/0x80 [ 67.538259] ? lock_is_held_type+0xea/0x140 [ 67.538259] new_sync_read+0x113/0x1a0 [ 67.538259] vfs_read+0x136/0x1d0 [ 67.538259] ksys_read+0x70/0xf0 [ 67.538259] __x64_sys_read+0x1a/0x20 [ 67.538259] do_syscall_64+0x3b/0xc0 [ 67.538259] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 67.538259] RIP: 0033:0x7f9c83e23cce [ 67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 [ 67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce [ 67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003 [ 67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000 [ 67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000 [ 67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [ 67.538259] </TASK> Fix this by checking that the aligned zone_movable_pfn[] does not exceed the end of the node, and if it does skip creating a movable zone on this node. Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone") Signed-off-by: Alistair Popple <apopple(a)nvidia.com> Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Mel Gorman <mgorman(a)techsingularity.net> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Zi Yan <ziy(a)nvidia.com> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) --- a/mm/page_alloc.c~mm-pages_allocc-dont-create-zone_movable-beyond-the-end-of-a-node +++ a/mm/page_alloc.c @@ -7951,10 +7951,17 @@ restart: out2: /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ - for (nid = 0; nid < MAX_NUMNODES; nid++) + for (nid = 0; nid < MAX_NUMNODES; nid++) { + unsigned long start_pfn, end_pfn; + zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); + get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); + if (zone_movable_pfn[nid] >= end_pfn) + zone_movable_pfn[nid] = 0; + } + out: /* restore the node_state */ node_states[N_MEMORY] = saved_node_state; _ Patches currently in -mm which might be from apopple(a)nvidia.com are

3 years, 7 months

1
0
0 0

[merged] mm-dont-skip-swap-entry-even-if-zap_details-specified.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mm: don't skip swap entry even if zap_details specified has been removed from the -mm tree. Its filename was mm-dont-skip-swap-entry-even-if-zap_details-specified.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Peter Xu <peterx(a)redhat.com> Subject: mm: don't skip swap entry even if zap_details specified Patch series "mm: Rework zap ptes on swap entries", v5. Patch 1 should fix a long standing bug for zap_pte_range() on zap_details usage. The risk is we could have some swap entries skipped while we should have zapped them. Migration entries are not the major concern because file backed memory always zap in the pattern that "first time without page lock, then re-zap with page lock" hence the 2nd zap will always make sure all migration entries are already recovered. However there can be issues with real swap entries got skipped errornoously. There's a reproducer provided in commit message of patch 1 for that. Patch 2-4 are cleanups that are based on patch 1. After the whole patchset applied, we should have a very clean view of zap_pte_range(). Only patch 1 needs to be backported to stable if necessary. This patch (of 4): The "details" pointer shouldn't be the token to decide whether we should skip swap entries. For example, when the callers specified details->zap_mapping==NULL, it means the user wants to zap all the pages (including COWed pages), then we need to look into swap entries because there can be private COWed pages that was swapped out. Skipping some swap entries when details is non-NULL may lead to wrongly leaving some of the swap entries while we should have zapped them. A reproducer of the problem: ===8<=== #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <assert.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int page_size; int shmem_fd; char *buffer; void main(void) { int ret; char val; page_size = getpagesize(); shmem_fd = memfd_create("test", 0); assert(shmem_fd >= 0); ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, shmem_fd, 0); assert(buffer != MAP_FAILED); /* Write private page, swap it out */ buffer[page_size] = 1; madvise(buffer, page_size * 2, MADV_PAGEOUT); /* This should drop private buffer[page_size] already */ ret = ftruncate(shmem_fd, page_size); assert(ret == 0); /* Recover the size */ ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); /* Re-read the data, it should be all zero */ val = buffer[page_size]; if (val == 0) printf("Good\n"); else printf("BUG\n"); } ===8<=== We don't need to touch up the pmd path, because pmd never had a issue with swap entries. For example, shmem pmd migration will always be split into pte level, and same to swapping on anonymous. Add another helper should_zap_cows() so that we can also check whether we should zap private mappings when there's no page pointer specified. This patch drops that trick, so we handle swap ptes coherently. Meanwhile we should do the same check upon migration entry, hwpoison entry and genuine swap entries too. To be explicit, we should still remember to keep the private entries if even_cows==false, and always zap them when even_cows==true. The issue seems to exist starting from the initial commit of git. [peterx(a)redhat.com: comment tweaks] Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: John Hubbard <jhubbard(a)nvidia.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Alistair Popple <apopple(a)nvidia.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: "Kirill A . Shutemov" <kirill(a)shutemov.name> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Yang Shi <shy828301(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory.c | 40 +++++++++++++++++++++++++++++++--------- 1 file changed, 31 insertions(+), 9 deletions(-) --- a/mm/memory.c~mm-dont-skip-swap-entry-even-if-zap_details-specified +++ a/mm/memory.c @@ -1313,6 +1313,17 @@ struct zap_details { struct folio *single_folio; /* Locked folio to be unmapped */ }; +/* Whether we should zap all COWed (private) pages too */ +static inline bool should_zap_cows(struct zap_details *details) +{ + /* By default, zap all pages */ + if (!details) + return true; + + /* Or, we zap COWed pages only if the caller wants to */ + return !details->zap_mapping; +} + /* * We set details->zap_mapping when we want to unmap shared but keep private * pages. Return true if skip zapping this page, false otherwise. @@ -1320,11 +1331,15 @@ struct zap_details { static inline bool zap_skip_check_mapping(struct zap_details *details, struct page *page) { - if (!details || !page) + /* If we can make a decision without *page.. */ + if (should_zap_cows(details)) + return false; + + /* E.g. the caller passes NULL for the case of a zero page */ + if (!page) return false; - return details->zap_mapping && - (details->zap_mapping != page_rmapping(page)); + return details->zap_mapping != page_rmapping(page); } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -1405,17 +1420,24 @@ again: continue; } - /* If details->check_mapping, we leave swap entries. */ - if (unlikely(details)) - continue; - - if (!non_swap_entry(entry)) + if (!non_swap_entry(entry)) { + /* Genuine swap entry, hence a private anon page */ + if (!should_zap_cows(details)) + continue; rss[MM_SWAPENTS]--; - else if (is_migration_entry(entry)) { + } else if (is_migration_entry(entry)) { struct page *page; page = pfn_swap_entry_to_page(entry); + if (zap_skip_check_mapping(details, page)) + continue; rss[mm_counter(page)]--; + } else if (is_hwpoison_entry(entry)) { + if (!should_zap_cows(details)) + continue; + } else { + /* We should have covered all the swap entry types */ + WARN_ON_ONCE(1); } if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); _ Patches currently in -mm which might be from peterx(a)redhat.com are

3 years, 7 months

1
0
0 0

[merged] mm-fs-fix-lru_cache_disabled-race-in-bh_lru.patch removed from -mm tree

by Andrew Morton

The patch titled Subject: mm: fs: fix lru_cache_disabled race in bh_lru has been removed from the -mm tree. Its filename was mm-fs-fix-lru_cache_disabled-race-in-bh_lru.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Minchan Kim <minchan(a)kernel.org> Subject: mm: fs: fix lru_cache_disabled race in bh_lru Check lru_cache_disabled under bh_lru_lock. Otherwise, it could introduce race below and it fails to migrate pages containing buffer_head. CPU 0 CPU 1 bh_lru_install lru_cache_disable lru_cache_disabled = false atomic_inc(&lru_disable_count); invalidate_bh_lrus_cpu of CPU 0 bh_lru_lock __invalidate_bh_lrus bh_lru_unlock bh_lru_lock install the bh bh_lru_unlock WHen this race happens a CMA allocation fails, which is critical for the workload which depends on CMA. Link: https://lkml.kernel.org/r/20220308180709.2017638-1-minchan@kernel.org Fixes: 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration") Signed-off-by: Minchan Kim <minchan(a)kernel.org> Cc: Chris Goldsworthy <cgoldswo(a)codeaurora.org> Cc: Marcelo Tosatti <mtosatti(a)redhat.com> Cc: John Dias <joaodias(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/buffer.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/fs/buffer.c~mm-fs-fix-lru_cache_disabled-race-in-bh_lru +++ a/fs/buffer.c @@ -1235,16 +1235,18 @@ static void bh_lru_install(struct buffer int i; check_irqs_on(); + bh_lru_lock(); + /* * the refcount of buffer_head in bh_lru prevents dropping the * attached page(i.e., try_to_free_buffers) so it could cause * failing page migration. * Skip putting upcoming bh into bh_lru until migration is done. */ - if (lru_cache_disabled()) + if (lru_cache_disabled()) { + bh_lru_unlock(); return; - - bh_lru_lock(); + } b = this_cpu_ptr(&bh_lrus); for (i = 0; i < BH_LRU_SIZE; i++) { _ Patches currently in -mm which might be from minchan(a)kernel.org are

3 years, 7 months

1
0
0 0

[patch 111/114] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

by Andrew Morton

From: Mauricio Faria de Oliveira <mfo(a)canonical.com> Subject: mm: fix race between MADV_FREE reclaim and blkdev direct IO read Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo(a)canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo(a)canonical.com> Reviewed-by: "Huang, Ying" <ying.huang(a)intel.com> Cc: Minchan Kim <minchan(a)kernel.org> Cc: Yu Zhao <yuzhao(a)google.com> Cc: Yang Shi <shy828301(a)gmail.com> Cc: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Dan Hill <daniel.hill(a)canonical.com> Cc: Dan Streetman <dan.streetman(a)canonical.com> Cc: Dongdong Tao <dongdong.tao(a)canonical.com> Cc: Gavin Guo <gavin.guo(a)canonical.com> Cc: Gerald Yang <gerald.yang(a)canonical.com> Cc: Heitor Alves de Siqueira <halves(a)canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki(a)canonical.com> Cc: Jay Vosburgh <jay.vosburgh(a)canonical.com> Cc: Matthew Ruffell <matthew.ruffell(a)canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan(a)canonical.com> Cc: <stable(a)vger.kernel.org> Cc: Christoph Hellwig <hch(a)infradead.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/rmap.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) --- a/mm/rmap.c~mm-fix-race-between-madv_free-reclaim-and-blkdev-direct-io-read +++ a/mm/rmap.c @@ -1588,7 +1588,30 @@ static bool try_to_unmap_one(struct foli /* MADV_FREE page check */ if (!folio_test_swapbacked(folio)) { - if (!folio_test_dirty(folio)) { + int ref_count, map_count; + + /* + * Synchronize with gup_pte_range(): + * - clear PTE; barrier; read refcount + * - inc refcount; barrier; read PTE + */ + smp_mb(); + + ref_count = folio_ref_count(folio); + map_count = folio_mapcount(folio); + + /* + * Order reads for page refcount and dirty flag + * (see comments in __remove_mapping()). + */ + smp_rmb(); + + /* + * The only page refs must be one from isolation + * plus the rmap(s) (dropped by discard:). + */ + if (ref_count == 1 + map_count && + !folio_test_dirty(folio)) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); _

3 years, 7 months

1
0
0 0

CPU excessively long times between frequency scaling driver calls - bisected

by Doug Smythies

Hi All, Note before: I do not know If I have the e-mail address list correct, nor am I actually a member of the x86 distribution list. I am on the linux-pm email list. When using the intel_pstate CPU frequency scaling driver with HWP disabled, active mode, powersave scaling governor, the times between calls to the driver have never exceeded 10 seconds. Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms" There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods. From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called. Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum. For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds. Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz Why this particular configuration, i.e. no-hwp, active, powersave? Because it is, by far, the easiest to observe what is going on. ... Doug

3 years, 7 months

8
46
0 0

stable-rc/queue/5.4 baseline: 49 runs, 2 regressions (v5.4.187-7-g7e2268c1a255)

by kernelci.org bot

stable-rc/queue/5.4 baseline: 49 runs, 2 regressions (v5.4.187-7-g7e2268c1a255) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions -------------------------+------+--------------+----------+--------------------+------------ qemu_arm-virt-gicv2-uefi | arm | lab-baylibre | gcc-10 | multi_v7_defconfig | 1 qemu_arm-virt-gicv3-uefi | arm | lab-baylibre | gcc-10 | multi_v7_defconfig | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F5.4/kernel/v5.4.187-… Test: baseline Tree: stable-rc Branch: queue/5.4 Describe: v5.4.187-7-g7e2268c1a255 URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 7e2268c1a255f52829a6cc75bf428394b81a9290 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions -------------------------+------+--------------+----------+--------------------+------------ qemu_arm-virt-gicv2-uefi | arm | lab-baylibre | gcc-10 | multi_v7_defconfig | 1 Details: https://kernelci.org/test/plan/id/623cce9ff12a8b0daa77251c Results: 0 PASS, 1 FAIL, 0 SKIP Full config: multi_v7_defconfig Compiler: gcc-10 (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.4/v5.4.187-7-g7e2268c1a255/… HTML log: https://storage.kernelci.org//stable-rc/queue-5.4/v5.4.187-7-g7e2268c1a255/… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/623cce9ff12a8b0daa77251d failing since 98 days (last pass: v5.4.165-9-g27d736c7bdee, first fail: v5.4.165-18-ge938927511cb) platform | arch | lab | compiler | defconfig | regressions -------------------------+------+--------------+----------+--------------------+------------ qemu_arm-virt-gicv3-uefi | arm | lab-baylibre | gcc-10 | multi_v7_defconfig | 1 Details: https://kernelci.org/test/plan/id/623cce893498815ff3772537 Results: 0 PASS, 1 FAIL, 0 SKIP Full config: multi_v7_defconfig Compiler: gcc-10 (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.4/v5.4.187-7-g7e2268c1a255/… HTML log: https://storage.kernelci.org//stable-rc/queue-5.4/v5.4.187-7-g7e2268c1a255/… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/623cce893498815ff3772538 failing since 98 days (last pass: v5.4.165-9-g27d736c7bdee, first fail: v5.4.165-18-ge938927511cb)

3 years, 7 months

1
0
0 0

stable-rc/queue/5.10 baseline: 99 runs, 1 regressions (v5.10.108-6-g7f2f3f28404e)

by kernelci.org bot

stable-rc/queue/5.10 baseline: 99 runs, 1 regressions (v5.10.108-6-g7f2f3f28404e) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F5.10/kernel/v5.10.10… Test: baseline Tree: stable-rc Branch: queue/5.10 Describe: v5.10.108-6-g7f2f3f28404e URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 7f2f3f28404e6bbf24efc45876ef07dcbada0955 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/plan/id/623cbac31e89a83341772507 Results: 90 PASS, 2 FAIL, 0 SKIP Full config: defconfig+arm64-chromebook Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.10/v5.10.108-6-g7f2f3f28404… HTML log: https://storage.kernelci.org//stable-rc/queue-5.10/v5.10.108-6-g7f2f3f28404… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.bootrr.rockchip-i2s1-probed: https://kernelci.org/test/case/id/623cbac31e89a83341772529 failing since 16 days (last pass: v5.10.103-56-ge5a40f18f4ce, first fail: v5.10.103-105-gf074cce6ae0d) 2022-03-24T18:38:43.554905 /lava-5941931/1/../bin/lava-test-case 2022-03-24T18:38:43.566134 <8>[ 33.948083] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=rockchip-i2s1-probed RESULT=fail>

3 years, 7 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror March 2022