November 2019 - Linux-stable-mirror

[patch 17/17] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

by akpm＠linux-foundation.org

From: Johannes Weiner <hannes(a)cmpxchg.org> Subject: mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges While upgrading from 4.16 to 5.2, we noticed these allocation errors in the log of the new kernel: [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0 [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0 slab_out_of_memory+1 ___slab_alloc+969 __slab_alloc+14 kmem_cache_alloc+346 inet_twsk_alloc+60 tcp_time_wait+46 tcp_fin+206 tcp_data_queue+2034 tcp_rcv_state_process+784 tcp_v6_do_rcv+405 __release_sock+118 tcp_close+385 inet_release+46 __sock_release+55 sock_close+17 __fput+170 task_work_run+127 exit_to_usermode_loop+191 do_syscall_64+212 entry_SYSCALL_64_after_hwframe+68 accompanied by an increase in machines going completely radio silent under memory pressure. One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account sock objects to kmemcg"), which made these slab caches subject to cgroup memory accounting and control. The problem with that is that cgroups, unlike the page allocator, do not maintain dedicated atomic reserves. As a cgroup's usage hovers at its limit, atomic allocations - such as done during network rx - can fail consistently for extended periods of time. The kernel is not able to operate under these conditions. We don't want to revert the culprit patch, because it indeed tracks a potentially substantial amount of memory used by a cgroup. We also don't want to implement dedicated atomic reserves for cgroups. There is no point in keeping a fixed margin of unused bytes in the cgroup's memory budget to accomodate a consumer that is impossible to predict - we'd be wasting memory and get into configuration headaches, not unlike what we have going with min_free_kbytes. We do this for physical mem because we have to, but cgroups are an accounting game. Instead, account these privileged allocations to the cgroup, but let them bypass the configured limit if they have to. This way, we get the benefits of accounting the consumed memory and have it exert pressure on the rest of the cgroup, but like with the page allocator, we shift the burden of reclaimining on behalf of atomic allocations onto the regular allocations that can block. Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg") Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb(a)google.com> Cc: Suleiman Souhlal <suleiman(a)google.com> Cc: Michal Hocko <mhocko(a)kernel.org> Cc: <stable(a)vger.kernel.org> [4.18+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) --- a/mm/memcontrol.c~mm-memcontrol-fix-network-errors-from-failing-__gfp_atomic-charges +++ a/mm/memcontrol.c @@ -2535,6 +2535,15 @@ retry: } /* + * Memcg doesn't have a dedicated reserve for atomic + * allocations. But like the global atomic pool, we need to + * put the burden of reclaim on regular allocation requests + * and let these go through as privileged allocations. + */ + if (gfp_mask & __GFP_ATOMIC) + goto force; + + /* * Unlike in global OOM situations, memcg is not in a physical * memory shortage. Allow dying and OOM-killed tasks to * bypass the last charges so that they can exit quickly and _

5 years, 7 months

1
0
0 0

[patch 14/17] mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly

by akpm＠linux-foundation.org

From: Roman Gushchin <guro(a)fb.com> Subject: mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly page_cgroup_ino() doesn't return a valid memcg pointer for non-compound slab pages, because it depends on PgHead AND PgSlab flags to be set to determine the memory cgroup from the kmem_cache. It's correct for compound pages, but not for generic small pages. Those don't have PgHead set, so it ends up returning zero. Fix this by replacing the condition to PageSlab() && !PageTail(). Before this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user(a)0.service/ | grep slab 0x0000000000000080 38 0 _______S___________________________________ slab After this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user(a)0.service/ | grep slab 0x0000000000000080 147 0 _______S___________________________________ slab Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order to filter error injection events based on memcg. So if page_cgroup_ino() fails to return memcg pointer, we just fail to inject memory error. Considering that hwpoison filter is for testing, affected users are limited and the impact should be marginal. [n-horiguchi(a)ah.jp.nec.com: changelog additions] Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro(a)fb.com> Reviewed-by: Shakeel Butt <shakeelb(a)google.com> Acked-by: David Rientjes <rientjes(a)google.com> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com> Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 2 +- mm/slab.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) --- a/mm/memcontrol.c~mm-slab-make-page_cgroup_ino-to-recognize-non-compound-slab-pages-properly +++ a/mm/memcontrol.c @@ -484,7 +484,7 @@ ino_t page_cgroup_ino(struct page *page) unsigned long ino = 0; rcu_read_lock(); - if (PageHead(page) && PageSlab(page)) + if (PageSlab(page) && !PageTail(page)) memcg = memcg_from_slab_page(page); else memcg = READ_ONCE(page->mem_cgroup); --- a/mm/slab.h~mm-slab-make-page_cgroup_ino-to-recognize-non-compound-slab-pages-properly +++ a/mm/slab.h @@ -323,8 +323,8 @@ static inline struct kmem_cache *memcg_r * Expects a pointer to a slab page. Please note, that PageSlab() check * isn't sufficient, as it returns true also for tail compound slab pages, * which do not have slab_cache pointer set. - * So this function assumes that the page can pass PageHead() and PageSlab() - * checks. + * So this function assumes that the page can pass PageSlab() && !PageTail() + * check. * * The kmem_cache can be reparented asynchronously. The caller must ensure * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex. _

5 years, 7 months

1
0
0 0

[patch 12/17] dump_stack: avoid the livelock of the dump_lock

by akpm＠linux-foundation.org

From: Kevin Hao <haokexin(a)gmail.com> Subject: dump_stack: avoid the livelock of the dump_lock In the current code, we use the atomic_cmpxchg() to serialize the output of the dump_stack(), but this implementation suffers the thundering herd problem. We have observed such kind of livelock on a Marvell cn96xx board(24 cpus) when heavily using the dump_stack() in a kprobe handler. Actually we can let the competitors to wait for the releasing of the lock before jumping to atomic_cmpxchg(). This will definitely mitigate the thundering herd problem. Thanks Linus for the suggestion. [akpm(a)linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20191030031637.6025-1-haokexin@gmail.com Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()") Signed-off-by: Kevin Hao <haokexin(a)gmail.com> Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- lib/dump_stack.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) --- a/lib/dump_stack.c~dump_stack-avoid-the-livelock-of-the-dump_lock +++ a/lib/dump_stack.c @@ -106,7 +106,12 @@ retry: was_locked = 1; } else { local_irq_restore(flags); - cpu_relax(); + /* + * Wait for the lock to release before jumping to + * atomic_cmpxchg() in order to mitigate the thundering herd + * problem. + */ + do { cpu_relax(); } while (atomic_read(&dump_lock) != -1); goto retry; } _

5 years, 7 months

1
0
0 0

[patch 09/17] mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y

by akpm＠linux-foundation.org

From: Ville Syrjälä <ville.syrjala(a)linux.intel.com> Subject: mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y I got some khugepaged spew on a 32bit x86: [ 217.490026] BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346 [ 217.492826] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged [ 217.495589] INFO: lockdep is turned off. [ 217.498371] CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206 [ 217.501233] Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009 [ 217.501697] Call Trace: [ 217.501697] dump_stack+0x66/0x8e [ 217.501697] ___might_sleep.cold.96+0x95/0xa6 [ 217.501697] __might_sleep+0x2e/0x80 [ 217.501697] collapse_huge_page.isra.51+0x5ac/0x1360 [ 217.501697] ? __alloc_pages_nodemask+0xec/0xf80 [ 217.501697] ? __alloc_pages_nodemask+0x191/0xf80 [ 217.501697] ? trace_hardirqs_on+0x4a/0xf0 [ 217.501697] khugepaged+0x9a9/0x20f0 [ 217.501697] ? _raw_spin_unlock+0x21/0x30 [ 217.501697] ? trace_hardirqs_on+0x4a/0xf0 [ 217.501697] ? wait_woken+0xa0/0xa0 [ 217.501697] kthread+0xf5/0x110 [ 217.501697] ? collapse_pte_mapped_thp+0x3b0/0x3b0 [ 217.501697] ? kthread_create_worker_on_cpu+0x20/0x20 [ 217.501697] ret_from_fork+0x2e/0x38 Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic() vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach and just reorder the two operations. Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()") Signed-off-by: Ville Syrjl <ville.syrjala(a)linux.intel.com> Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Borislav Petkov <bp(a)alien8.de> Cc: "H. Peter Anvin" <hpa(a)zytor.com> Cc: Jérôme Glisse <jglisse(a)redhat.com> Cc: Ralph Campbell <rcampbell(a)nvidia.com> Cc: Ira Weiny <ira.weiny(a)intel.com> Cc: Jason Gunthorpe <jgg(a)mellanox.com> Cc: Daniel Vetter <daniel.vetter(a)intel.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/khugepaged.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) --- a/mm/khugepaged.c~khugepaged-might_sleep-warn-due-to-config_highpte=y +++ a/mm/khugepaged.c @@ -1028,12 +1028,13 @@ static void collapse_huge_page(struct mm anon_vma_lock_write(vma->anon_vma); - pte = pte_offset_map(pmd, address); - pte_ptl = pte_lockptr(mm, pmd); - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, address, address + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); + + pte = pte_offset_map(pmd, address); + pte_ptl = pte_lockptr(mm, pmd); + pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* * After this gup_fast can't run anymore. This also removes _

5 years, 7 months

1
0
0 0

[patch 07/17] mm, vmstat: hide /proc/pagetypeinfo from normal users

by akpm＠linux-foundation.org

From: Michal Hocko <mhocko(a)suse.com> Subject: mm, vmstat: hide /proc/pagetypeinfo from normal users /proc/pagetypeinfo is a debugging tool to examine internal page allocator state wrt to fragmentation. It is not very useful for any other use so normal users really do not need to read this file. Waiman Long has noticed that reading this file can have negative side effects because zone->lock is necessary for gathering data and that a) interferes with the page allocator and its users and b) can lead to hard lockups on large machines which have very long free_list. Reduce both issues by simply not exporting the file to regular users. Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Michal Hocko <mhocko(a)suse.com> Reported-by: Waiman Long <longman(a)redhat.com> Acked-by: Mel Gorman <mgorman(a)suse.de> Acked-by: Vlastimil Babka <vbabka(a)suse.cz> Acked-by: Waiman Long <longman(a)redhat.com> Acked-by: Rafael Aquini <aquini(a)redhat.com> Acked-by: David Rientjes <rientjes(a)google.com> Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: David Hildenbrand <david(a)redhat.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Roman Gushchin <guro(a)fb.com> Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru> Cc: Jann Horn <jannh(a)google.com> Cc: Song Liu <songliubraving(a)fb.com> Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/vmstat.c~mm-vmstat-hide-proc-pagetypeinfo-from-normal-users +++ a/mm/vmstat.c @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void) #endif #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); - proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op); + proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); proc_create_seq("vmstat", 0444, NULL, &vmstat_op); proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op); #endif _

5 years, 7 months

1
0
0 0

[patch 04/17] mm: thp: handle page cache THP correctly in PageTransCompoundMap

by akpm＠linux-foundation.org

From: Yang Shi <yang.shi(a)linux.alibaba.com> Subject: mm: thp: handle page cache THP correctly in PageTransCompoundMap We have a usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below: 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 12 And some benchmarks do worse than with anonymous THPs. By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry. So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time. Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate. With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below: 7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 271 And the benchmarks are as same as anonymous THPs. [yang.shi(a)linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.a… Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.a… Fixes: dd78fedde4b9 ("rmap: support file thp") Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com> Reported-by: Gang Deng <gavin.dg(a)linux.alibaba.com> Tested-by: Gang Deng <gavin.dg(a)linux.alibaba.com> Suggested-by: Hugh Dickins <hughd(a)google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/mm.h | 5 ----- include/linux/mm_types.h | 5 +++++ include/linux/page-flags.h | 20 ++++++++++++++++++-- 3 files changed, 23 insertions(+), 7 deletions(-) --- a/include/linux/mm.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap +++ a/include/linux/mm.h @@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, s extern void kvfree(const void *addr); -static inline atomic_t *compound_mapcount_ptr(struct page *page) -{ - return &page[1].compound_mapcount; -} - static inline int compound_mapcount(struct page *page) { VM_BUG_ON_PAGE(!PageCompound(page), page); --- a/include/linux/mm_types.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap +++ a/include/linux/mm_types.h @@ -221,6 +221,11 @@ struct page { #endif } _struct_page_alignment; +static inline atomic_t *compound_mapcount_ptr(struct page *page) +{ + return &page[1].compound_mapcount; +} + /* * Used for sizing the vmemmap region on some architectures */ --- a/include/linux/page-flags.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap +++ a/include/linux/page-flags.h @@ -622,12 +622,28 @@ static inline int PageTransCompound(stru * * Unlike PageTransCompound, this is safe to be called only while * split_huge_pmd() cannot run from under us, like if protected by the - * MMU notifier, otherwise it may result in page->_mapcount < 0 false + * MMU notifier, otherwise it may result in page->_mapcount check false * positives. + * + * We have to treat page cache THP differently since every subpage of it + * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE + * mapped in the current process so comparing subpage's _mapcount to + * compound_mapcount to filter out PTE mapped case. */ static inline int PageTransCompoundMap(struct page *page) { - return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0; + struct page *head; + + if (!PageTransCompound(page)) + return 0; + + if (PageAnon(page)) + return atomic_read(&page->_mapcount) < 0; + + head = compound_head(page); + /* File THP is PMD mapped and not PTE mapped */ + return atomic_read(&page->_mapcount) == + atomic_read(compound_mapcount_ptr(head)); } /* _

5 years, 7 months

1
0
0 0

[patch 03/17] mm, meminit: recalculate pcpu batch and high limits after init completes

by akpm＠linux-foundation.org

From: Mel Gorman <mgorman(a)techsingularity.net> Subject: mm, meminit: recalculate pcpu batch and high limits after init completes Deferred memory initialisation updates zone->managed_pages during the initialisation phase but before that finishes, the per-cpu page allocator (pcpu) calculates the number of pages allocated/freed in batches as well as the maximum number of pages allowed on a per-cpu list. As zone->managed_pages is not up to date yet, the pcpu initialisation calculates inappropriately low batch and high values. This increases zone lock contention quite severely in some cases with the degree of severity depending on how many CPUs share a local zone and the size of the zone. A private report indicated that kernel build times were excessive with extremely high system CPU usage. A perf profile indicated that a large chunk of time was lost on zone->lock contention. This patch recalculates the pcpu batch and high values after deferred initialisation completes for every populated zone in the system. It was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation workload -- allmodconfig and all available CPUs. mmtests configuration: config-workload-kernbench-max Configuration was modified to build on a fresh XFS partition. kernbench 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%* Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%* Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%) Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%) Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%) 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Duration User 39766.24 49221.79 Duration System 44298.10 13361.67 Duration Elapsed 519.11 388.87 The patch reduces system CPU usage by 69.86% and total build time by 26.65%. The variance of system CPU usage is also much reduced. Before, this was the breakdown of batch and high values over all zones was. 256 batch: 1 256 batch: 63 512 batch: 7 256 high: 0 256 high: 378 512 high: 42 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch: 256 batch: 1 768 batch: 63 256 high: 0 768 high: 378 [mgorman(a)techsingularity.net: fix merge/linkage snafu] Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net> Acked-by: Michal Hocko <mhocko(a)suse.com> Acked-by: Vlastimil Babka <vbabka(a)suse.cz> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Matt Fleming <matt(a)codeblueprint.co.uk> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Borislav Petkov <bp(a)alien8.de> Cc: Qian Cai <cai(a)lca.pw> Cc: <stable(a)vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) --- a/mm/page_alloc.c~mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes +++ a/mm/page_alloc.c @@ -1948,6 +1948,14 @@ void __init page_alloc_init_late(void) wait_for_completion(&pgdat_init_all_done_comp); /* + * The number of managed pages has changed due to the initialisation + * so the pcpu batch and high limits needs to be updated or the limits + * will be artificially small. + */ + for_each_populated_zone(zone) + zone_pcp_update(zone); + + /* * We initialized the rest of the deferred pages. Permanently disable * on-demand struct page initialization. */ @@ -8514,7 +8522,6 @@ void free_contig_range(unsigned long pfn WARN(count != 0, "%d pages are still in use!\n", count); } -#ifdef CONFIG_MEMORY_HOTPLUG /* * The zone indicated has a new number of managed_pages; batch sizes and percpu * page high values need to be recalulated. @@ -8528,7 +8535,6 @@ void __meminit zone_pcp_update(struct zo per_cpu_ptr(zone->pageset, cpu)); mutex_unlock(&pcp_batch_high_lock); } -#endif void zone_pcp_reset(struct zone *zone) { _

5 years, 7 months

1
0
0 0

[patch 01/17] mm: memcontrol: fix NULL-ptr deref in percpu stats flush

by akpm＠linux-foundation.org

From: Shakeel Butt <shakeelb(a)google.com> Subject: mm: memcontrol: fix NULL-ptr deref in percpu stats flush __mem_cgroup_free() can be called on the failure path in mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free() access the fields of memcg which can potentially be null if called from failure path from mem_cgroup_alloc(). Indeed syzbot has reported the following crash: R13: 00000000004bf27d R14: 00000000004db028 R15: 0000000000000003 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436 Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90 RSP: 0018:ffff888095c27980 EFLAGS: 00010206 RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000 RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007 RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33 R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760 R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007 FS: 00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021 mem_cgroup_free mm/memcontrol.c:5033 [inline] mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160 css_create kernel/cgroup/cgroup.c:5156 [inline] cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119 cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401 kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124 vfs_mkdir+0x42e/0x670 fs/namei.c:3807 do_mkdirat+0x234/0x2a0 fs/namei.c:3830 __do_sys_mkdir fs/namei.c:3846 [inline] __se_sys_mkdir fs/namei.c:3844 [inline] __x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x459a59 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f502716fc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00007f502716fc90 RCX: 0000000000459a59 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000180 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f50271706d4 R13: 00000000004bf27d R14: 00000000004db028 R15: 0000000000000003 Fixing this by moving the flush to mem_cgroup_free as there is no need to flush anything if we see failure in mem_cgroup_alloc(). Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt <shakeelb(a)google.com> Reported-by: syzbot+515d5bcfe179cdf049b2(a)syzkaller.appspotmail.com Reviewed-by: Roman Gushchin <guro(a)fb.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) --- a/mm/memcontrol.c~mm-memcontrol-fix-null-ptr-deref-in-percpu-stats-flush +++ a/mm/memcontrol.c @@ -5014,12 +5014,6 @@ static void __mem_cgroup_free(struct mem { int node; - /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. - */ - memcg_flush_percpu_vmstats(memcg, false); - memcg_flush_percpu_vmevents(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); @@ -5030,6 +5024,12 @@ static void __mem_cgroup_free(struct mem static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); + /* + * Flush percpu vmstats and vmevents to guarantee the value correctness + * on parent's and all ancestor levels. + */ + memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); } _

5 years, 7 months

1
0
0 0

Re: [PATCH] clone3: validate stack arguments

by Christian Brauner

On Thu, Oct 31, 2019 at 12:40:36PM +0000, Sasha Levin wrote: > Hi, > > [This is an automated email] > > This commit has been processed because it contains a "Fixes:" tag, > fixing commit: 7f192e3cd316b fork: add clone3. > > The bot has tested the following trees: v5.3.8. > > v5.3.8: Failed to apply! Possible dependencies: > 78f6face5af34 ("sched: add kernel-doc for struct clone_args") > > > NOTE: The patch will not be queued to stable trees until it is upstream. > > How should we proceed with this patch? Hey Sasha, This has now landed in mainline (cf. [2]). I would suggest to backport [1] together with [2]. The patch in [1] only documents struct clone_args and has no functional changes. If you prefer to only backport a v5.3 specific version of [2] you can find it inline (cf. [3]) including the base commit info for the 5.3 stable tree. Christian [1]: 78f6face5af3 ("sched: add kernel-doc for struct clone_args") [2]: fa729c4df558 ("clone3: validate stack arguments") [3]: >From 5bc5279d0dfa90cc6af385b6e3f65958f223ccab Mon Sep 17 00:00:00 2001 From: Christian Brauner <christian.brauner(a)ubuntu.com> Date: Thu, 31 Oct 2019 12:36:08 +0100 Subject: [PATCH] clone3: validate stack arguments Validate the stack arguments and setup the stack depening on whether or not it is growing down or up. Legacy clone() required userspace to know in which direction the stack is growing and pass down the stack pointer appropriately. To make things more confusing microblaze uses a variant of the clone() syscall selected by CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument. IA64 has a separate clone2() syscall which also takes an additional stack_size argument. Finally, parisc has a stack that is growing upwards. Userspace therefore has a lot nasty code like the following: #define __STACK_SIZE (8 * 1024 * 1024) pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd) { pid_t ret; void *stack; stack = malloc(__STACK_SIZE); if (!stack) return -ENOMEM; #ifdef __ia64__ ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd); #elif defined(__parisc__) /* stack grows up */ ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd); #else ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd); #endif return ret; } or even crazier variants such as [3]. With clone3() we have the ability to validate the stack. We can check that when stack_size is passed, the stack pointer is valid and the other way around. We can also check that the memory area userspace gave us is fine to use via access_ok(). Furthermore, we probably should not require userspace to know in which direction the stack is growing. It is easy for us to do this in the kernel and I couldn't find the original reasoning behind exposing this detail to userspace. /* Intentional user visible API change */ clone3() was released with 5.3. Currently, it is not documented and very unclear to userspace how the stack and stack_size argument have to be passed. After talking to glibc folks we concluded that trying to change clone3() to setup the stack instead of requiring userspace to do this is the right course of action. Note, that this is an explicit change in user visible behavior we introduce with this patch. If it breaks someone's use-case we will revert! (And then e.g. place the new behavior under an appropriate flag.) Breaking someone's use-case is very unlikely though. First, neither glibc nor musl currently expose a wrapper for clone3(). Second, there is no real motivation for anyone to use clone3() directly since it does not provide features that legacy clone doesn't. New features for clone3() will first happen in v5.5 which is why v5.4 is still a good time to try and make that change now and backport it to v5.3. Searches on [4] did not reveal any packages calling clone3(). [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17… [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa… [4]: https://codesearch.debian.net Fixes: 7f192e3cd316 ("fork: add clone3") Cc: Kees Cook <keescook(a)chromium.org> Cc: Jann Horn <jannh(a)google.com> Cc: David Howells <dhowells(a)redhat.com> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Oleg Nesterov <oleg(a)redhat.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Florian Weimer <fweimer(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: linux-api(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: <stable(a)vger.kernel.org> # 5.3 Cc: GNU C Library <libc-alpha(a)sourceware.org> Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com> Acked-by: Arnd Bergmann <arnd(a)arndb.de> Acked-by: Aleksa Sarai <cyphar(a)cyphar.com> Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.c… --- kernel/fork.c | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 3647097e6783..8bbd39585301 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2586,7 +2586,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, return 0; } -static bool clone3_args_valid(const struct kernel_clone_args *kargs) +/** + * clone3_stack_valid - check and prepare stack + * @kargs: kernel clone args + * + * Verify that the stack arguments userspace gave us are sane. + * In addition, set the stack direction for userspace since it's easy for us to + * determine. + */ +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs) +{ + if (kargs->stack == 0) { + if (kargs->stack_size > 0) + return false; + } else { + if (kargs->stack_size == 0) + return false; + + if (!access_ok((void __user *)kargs->stack, kargs->stack_size)) + return false; + +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64) + kargs->stack += kargs->stack_size; +#endif + } + + return true; +} + +static bool clone3_args_valid(struct kernel_clone_args *kargs) { /* * All lower bits of the flag word are taken. @@ -2606,6 +2634,9 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs) kargs->exit_signal) return false; + if (!clone3_stack_valid(kargs)) + return false; + return true; } base-commit: db0655e705be645ad673b0a70160921e088517c0 -- 2.23.0

5 years, 7 months

1
0
0 0

Re: [PATCH] nvme: change nvme_passthru_cmd64 to explicitly mark rsvd

by Christoph Hellwig

On Tue, Nov 05, 2019 at 08:39:12AM +0100, Marta Rybczynska wrote: > Looks good to me. However, please note that the new ioctl made it already to 5.3.8. It wasn't in 5.3, but it seems like you are right and it somehow got picked for the stable releases. Sasha, can you please revert 76d609da9ed1cc0dc780e2b539d7b827ce28f182 in 5.3-stable ASAP and make sure crap like backporting new ABIs that haven't seen a release yet is never ever going to happen again?

5 years, 7 months

3
3
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror November 2019