Linux-stable-mirror December 2022

linux-stable-mirror@lists.linaro.org

349 participants
1333 discussions

[PATCH 5.4 0/2] Fix epoll issue in 5.4 kernels

by Rishabh Bhatnagar

Hi Greg After upgrading to 5.4.211 we were started seeing some nodes getting stuck in our Kubernetes cluster. All nodes are running this kernel version. After taking a closer look it seems that runc was command getting stuck. Looking at the stack it appears the thread is stuck in epoll wait for sometime. [<0>] do_syscall_64+0x48/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 [<0>] ep_poll+0x48d/0x4e0 [<0>] do_epoll_wait+0xab/0xc0 [<0>] __x64_sys_epoll_pwait+0x4d/0xa0 [<0>] do_syscall_64+0x48/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 [<0>] futex_wait_queue_me+0xb6/0x110 [<0>] futex_wait+0xe2/0x260 [<0>] do_futex+0x372/0x4f0 [<0>] __x64_sys_futex+0x134/0x180 [<0>] do_syscall_64+0x48/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 I noticed there are other discussions going on as well regarding this. https://lore.kernel.org/all/Y1pY2n6E1Xa58MXv@kroah.com/ Reverting the below patch does fix the issue: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=… We don't see this issue in latest upstream kernel or even latest 5.10 stable tree. Looking at the patches that went in for 5.10 stable there's one that stands out that seems to be missing in 5.4. 289caf5d8f6c61c6d2b7fd752a7f483cd153f182 (epoll: check for events when removing a timed out thread from the wait queue) Backporting this patch to 5.4 we don't see the hangups anymore. Looks like this patch fixes time out scenarios which might cause missed wake ups. The other patch in the patch series also fixes a race and is needed for the second patch to apply. Roman Penyaev (1): epoll: call final ep_events_available() check under the lock Soheil Hassas Yeganeh (1): epoll: check for events when removing a timed out thread from the wait queue fs/eventpoll.c | 68 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 27 deletions(-) -- 2.37.1

3 years

[merged mm-stable] mm-compaction-fix-fast_isolate_around-to-stay-within-boundaries.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm, compaction: fix fast_isolate_around() to stay within boundaries has been removed from the -mm tree. Its filename was mm-compaction-fix-fast_isolate_around-to-stay-within-boundaries.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: NARIBAYASHI Akira <a.naribayashi(a)fujitsu.com> Subject: mm, compaction: fix fast_isolate_around() to stay within boundaries Date: Wed, 26 Oct 2022 20:24:38 +0900 Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/20221026112438.236336-1-a.naribayashi@fujitsu.com Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <a.naribayashi(a)fujitsu.com> Cc: David Rientjes <rientjes(a)google.com> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/compaction.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) --- a/mm/compaction.c~mm-compaction-fix-fast_isolate_around-to-stay-within-boundaries +++ a/mm/compaction.c @@ -1344,7 +1344,7 @@ move_freelist_tail(struct list_head *fre } static void -fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long nr_isolated) +fast_isolate_around(struct compact_control *cc, unsigned long pfn) { unsigned long start_pfn, end_pfn; struct page *page; @@ -1365,21 +1365,13 @@ fast_isolate_around(struct compact_contr if (!page) return; - /* Scan before */ - if (start_pfn != pfn) { - isolate_freepages_block(cc, &start_pfn, pfn, &cc->freepages, 1, false); - if (cc->nr_freepages >= cc->nr_migratepages) - return; - } - - /* Scan after */ - start_pfn = pfn + nr_isolated; - if (start_pfn < end_pfn) - isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false); + isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false); /* Skip this pageblock in the future as it's full or nearly full */ if (cc->nr_freepages < cc->nr_migratepages) set_pageblock_skip(page); + + return; } /* Search orders in round-robin fashion */ @@ -1556,7 +1548,7 @@ fast_isolate_freepages(struct compact_co return cc->free_pfn; low_pfn = page_to_pfn(page); - fast_isolate_around(cc, low_pfn, nr_isolated); + fast_isolate_around(cc, low_pfn); return low_pfn; } _ Patches currently in -mm which might be from a.naribayashi(a)fujitsu.com are

3 years

[merged mm-stable] mm-gup-disallow-foll_forcefoll_write-on-hugetlb-mappings.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/gup: disallow FOLL_FORCE|FOLL_WRITE on hugetlb mappings has been removed from the -mm tree. Its filename was mm-gup-disallow-foll_forcefoll_write-on-hugetlb-mappings.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: David Hildenbrand <david(a)redhat.com> Subject: mm/gup: disallow FOLL_FORCE|FOLL_WRITE on hugetlb mappings Date: Mon, 31 Oct 2022 16:25:24 +0100 hugetlb does not support fake write-faults (write faults without write permissions). However, we are currently able to trigger a FAULT_FLAG_WRITE fault on a VMA without VM_WRITE. If we'd ever want to support FOLL_FORCE|FOLL_WRITE, we'd have to teach hugetlb to: (1) Leave the page mapped R/O after the fake write-fault, like maybe_mkwrite() does. (2) Allow writing to an exclusive anon page that's mapped R/O when FOLL_FORCE is set, like can_follow_write_pte(). E.g., __follow_hugetlb_must_fault() needs adjustment. For now, it's not clear if that added complexity is really required. History tolds us that FOLL_FORCE is dangerous and that we better limit its use to a bare minimum. -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <stdint.h> #include <sys/mman.h> #include <linux/mman.h> int main(int argc, char **argv) { char *map; int mem_fd; map = mmap(NULL, 2 * 1024 * 1024u, PROT_READ, MAP_PRIVATE|MAP_ANON|MAP_HUGETLB|MAP_HUGE_2MB, -1, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed: %d\n", errno); return 1; } mem_fd = open("/proc/self/mem", O_RDWR); if (mem_fd < 0) { fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno); return 1; } if (pwrite(mem_fd, "0", 1, (uintptr_t) map) == 1) { fprintf(stderr, "write() succeeded, which is unexpected\n"); return 1; } printf("write() failed as expected: %d\n", errno); return 0; } -------------------------------------------------------------------------- Fortunately, we have a sanity check in hugetlb_wp() in place ever since commit 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings"), that bails out instead of silently mapping a page writable in a !PROT_WRITE VMA. Consequently, above reproducer triggers a warning, similar to the one reported by szsbot: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3612 at mm/hugetlb.c:5313 hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313 Modules linked in: CPU: 1 PID: 3612 Comm: syz-executor250 Not tainted 6.1.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313 Code: ea 03 80 3c 02 00 0f 85 31 14 00 00 49 8b 5f 20 31 ff 48 89 dd 83 e5 02 48 89 ee e8 70 ab b7 ff 48 85 ed 75 5b e8 76 ae b7 ff <0f> 0b 41 bd 40 00 00 00 e8 69 ae b7 ff 48 b8 00 00 00 00 00 fc ff RSP: 0018:ffffc90003caf620 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000008640070 RCX: 0000000000000000 RDX: ffff88807b963a80 RSI: ffffffff81c4ed2a RDI: 0000000000000007 RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000008c07e R12: ffff888023805800 R13: 0000000000000000 R14: ffffffff91217f38 R15: ffff88801d4b0360 FS: 0000555555bba300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff7a47a1b8 CR3: 000000002378d000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> hugetlb_no_page mm/hugetlb.c:5755 [inline] hugetlb_fault+0x19cc/0x2060 mm/hugetlb.c:5874 follow_hugetlb_page+0x3f3/0x1850 mm/hugetlb.c:6301 __get_user_pages+0x2cb/0xf10 mm/gup.c:1202 __get_user_pages_locked mm/gup.c:1434 [inline] __get_user_pages_remote+0x18f/0x830 mm/gup.c:2187 get_user_pages_remote+0x84/0xc0 mm/gup.c:2260 __access_remote_vm+0x287/0x6b0 mm/memory.c:5517 ptrace_access_vm+0x181/0x1d0 kernel/ptrace.c:61 generic_ptrace_pokedata kernel/ptrace.c:1323 [inline] ptrace_request+0xb46/0x10c0 kernel/ptrace.c:1046 arch_ptrace+0x36/0x510 arch/x86/kernel/ptrace.c:828 __do_sys_ptrace kernel/ptrace.c:1296 [inline] __se_sys_ptrace kernel/ptrace.c:1269 [inline] __x64_sys_ptrace+0x178/0x2a0 kernel/ptrace.c:1269 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] So let's silence that warning by teaching GUP code that FOLL_FORCE -- so far -- does not apply to hugetlb. Note that FOLL_FORCE for read-access seems to be working as expected. The assumption is that this has been broken forever, only ever since above commit, we actually detect the wrong handling and WARN_ON_ONCE(). I assume this has been broken at least since 2014, when mm/gup.c came to life. I failed to come up with a suitable Fixes tag quickly. Link: https://lkml.kernel.org/r/20221031152524.173644-1-david@redhat.com Fixes: 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings") Signed-off-by: David Hildenbrand <david(a)redhat.com> Reported-by: <syzbot+f0b97304ef90f0d0b1dc(a)syzkaller.appspotmail.com> Cc: Mike Kravetz <mike.kravetz(a)oracle.com> Cc: Peter Xu <peterx(a)redhat.com> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/gup.c | 3 +++ 1 file changed, 3 insertions(+) --- a/mm/gup.c~mm-gup-disallow-foll_forcefoll_write-on-hugetlb-mappings +++ a/mm/gup.c @@ -1009,6 +1009,9 @@ static int check_vma_flags(struct vm_are if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; + /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */ + if (is_vm_hugetlb_page(vma)) + return -EFAULT; /* * We used to let the write,force case do COW in a * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could _ Patches currently in -mm which might be from david(a)redhat.com are selftests-vm-add-ksm-unmerge-tests.patch mm-pagewalk-dont-trigger-test_walk-in-walk_page_vma.patch selftests-vm-add-test-to-measure-madv_unmergeable-performance.patch mm-ksm-simplify-break_ksm-to-not-rely-on-vm_fault_write.patch mm-remove-vm_fault_write.patch mm-ksm-fix-ksm-cow-breaking-with-userfaultfd-wp-via-fault_flag_unshare.patch mm-pagewalk-add-walk_page_range_vma.patch mm-ksm-convert-break_ksm-to-use-walk_page_range_vma.patch mm-gup-remove-foll_migration.patch

3 years

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror December 2022