From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
While upgrading from 4.16 to 5.2, we noticed these allocation errors in
the log of the new kernel:
[ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
[ 8642.293009] node 0: slabs: 5, objs: 170, free: 0
slab_out_of_memory+1
___slab_alloc+969
__slab_alloc+14
kmem_cache_alloc+346
inet_twsk_alloc+60
tcp_time_wait+46
tcp_fin+206
tcp_data_queue+2034
tcp_rcv_state_process+784
tcp_v6_do_rcv+405
__release_sock+118
tcp_close+385
inet_release+46
__sock_release+55
sock_close+17
__fput+170
task_work_run+127
exit_to_usermode_loop+191
do_syscall_64+212
entry_SYSCALL_64_after_hwframe+68
accompanied by an increase in machines going completely radio silent under
memory pressure.
One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account sock
objects to kmemcg"), which made these slab caches subject to cgroup memory
accounting and control.
The problem with that is that cgroups, unlike the page allocator, do not
maintain dedicated atomic reserves. As a cgroup's usage hovers at its
limit, atomic allocations - such as done during network rx - can fail
consistently for extended periods of time. The kernel is not able to
operate under these conditions.
We don't want to revert the culprit patch, because it indeed tracks a
potentially substantial amount of memory used by a cgroup.
We also don't want to implement dedicated atomic reserves for cgroups.
There is no point in keeping a fixed margin of unused bytes in the
cgroup's memory budget to accomodate a consumer that is impossible to
predict - we'd be wasting memory and get into configuration headaches, not
unlike what we have going with min_free_kbytes. We do this for physical
mem because we have to, but cgroups are an accounting game.
Instead, account these privileged allocations to the cgroup, but let them
bypass the configured limit if they have to. This way, we get the
benefits of accounting the consumed memory and have it exert pressure on
the rest of the cgroup, but like with the page allocator, we shift the
burden of reclaimining on behalf of atomic allocations onto the regular
allocations that can block.
Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Suleiman Souhlal <suleiman(a)google.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [4.18+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 9 +++++++++
1 file changed, 9 insertions(+)
--- a/mm/memcontrol.c~mm-memcontrol-fix-network-errors-from-failing-__gfp_atomic-charges
+++ a/mm/memcontrol.c
@@ -2535,6 +2535,15 @@ retry:
}
/*
+ * Memcg doesn't have a dedicated reserve for atomic
+ * allocations. But like the global atomic pool, we need to
+ * put the burden of reclaim on regular allocation requests
+ * and let these go through as privileged allocations.
+ */
+ if (gfp_mask & __GFP_ATOMIC)
+ goto force;
+
+ /*
* Unlike in global OOM situations, memcg is not in a physical
* memory shortage. Allow dying and OOM-killed tasks to
* bypass the last charges so that they can exit quickly and
_
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly
page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
slab pages, because it depends on PgHead AND PgSlab flags to be set to
determine the memory cgroup from the kmem_cache. It's correct for
compound pages, but not for generic small pages. Those don't have PgHead
set, so it ends up returning zero.
Fix this by replacing the condition to PageSlab() && !PageTail().
Before this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user(a)0.service/ | grep slab
0x0000000000000080 38 0 _______S___________________________________ slab
After this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user(a)0.service/ | grep slab
0x0000000000000080 147 0 _______S___________________________________ slab
Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
to filter error injection events based on memcg. So if
page_cgroup_ino() fails to return memcg pointer, we just fail to inject
memory error. Considering that hwpoison filter is for testing,
affected users are limited and the impact should be marginal.
[n-horiguchi(a)ah.jp.nec.com: changelog additions]
Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: David Rientjes <rientjes(a)google.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 2 +-
mm/slab.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
--- a/mm/memcontrol.c~mm-slab-make-page_cgroup_ino-to-recognize-non-compound-slab-pages-properly
+++ a/mm/memcontrol.c
@@ -484,7 +484,7 @@ ino_t page_cgroup_ino(struct page *page)
unsigned long ino = 0;
rcu_read_lock();
- if (PageHead(page) && PageSlab(page))
+ if (PageSlab(page) && !PageTail(page))
memcg = memcg_from_slab_page(page);
else
memcg = READ_ONCE(page->mem_cgroup);
--- a/mm/slab.h~mm-slab-make-page_cgroup_ino-to-recognize-non-compound-slab-pages-properly
+++ a/mm/slab.h
@@ -323,8 +323,8 @@ static inline struct kmem_cache *memcg_r
* Expects a pointer to a slab page. Please note, that PageSlab() check
* isn't sufficient, as it returns true also for tail compound slab pages,
* which do not have slab_cache pointer set.
- * So this function assumes that the page can pass PageHead() and PageSlab()
- * checks.
+ * So this function assumes that the page can pass PageSlab() && !PageTail()
+ * check.
*
* The kmem_cache can be reparented asynchronously. The caller must ensure
* the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
_
From: Kevin Hao <haokexin(a)gmail.com>
Subject: dump_stack: avoid the livelock of the dump_lock
In the current code, we use the atomic_cmpxchg() to serialize the output
of the dump_stack(), but this implementation suffers the thundering herd
problem. We have observed such kind of livelock on a Marvell cn96xx
board(24 cpus) when heavily using the dump_stack() in a kprobe handler.
Actually we can let the competitors to wait for the releasing of the lock
before jumping to atomic_cmpxchg(). This will definitely mitigate the
thundering herd problem. Thanks Linus for the suggestion.
[akpm(a)linux-foundation.org: fix comment]
Link: http://lkml.kernel.org/r/20191030031637.6025-1-haokexin@gmail.com
Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()")
Signed-off-by: Kevin Hao <haokexin(a)gmail.com>
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/dump_stack.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- a/lib/dump_stack.c~dump_stack-avoid-the-livelock-of-the-dump_lock
+++ a/lib/dump_stack.c
@@ -106,7 +106,12 @@ retry:
was_locked = 1;
} else {
local_irq_restore(flags);
- cpu_relax();
+ /*
+ * Wait for the lock to release before jumping to
+ * atomic_cmpxchg() in order to mitigate the thundering herd
+ * problem.
+ */
+ do { cpu_relax(); } while (atomic_read(&dump_lock) != -1);
goto retry;
}
_
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm, vmstat: hide /proc/pagetypeinfo from normal users
/proc/pagetypeinfo is a debugging tool to examine internal page allocator
state wrt to fragmentation. It is not very useful for any other use so
normal users really do not need to read this file.
Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.
Reduce both issues by simply not exporting the file to regular users.
Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Waiman Long <longman(a)redhat.com>
Acked-by: Mel Gorman <mgorman(a)suse.de>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Waiman Long <longman(a)redhat.com>
Acked-by: Rafael Aquini <aquini(a)redhat.com>
Acked-by: David Rientjes <rientjes(a)google.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Jann Horn <jannh(a)google.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmstat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmstat.c~mm-vmstat-hide-proc-pagetypeinfo-from-normal-users
+++ a/mm/vmstat.c
@@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
#endif
#ifdef CONFIG_PROC_FS
proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
- proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
+ proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
#endif
_
From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm: thp: handle page cache THP correctly in PageTransCompoundMap
We have a usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well. But, our test shows the EPT is not PMD
mapped even though the underlying THP are PMD mapped on host. The number
showed by /sys/kernel/debug/kvm/largepage is much less than the number of
PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 579584 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if there
is a single PTE mapping on the page for anonymous THP when setting up EPT
map. But, the _mapcount < 0 check doesn't fit to page cache THP since
every subpage of page cache THP would get _mapcount inc'ed once it is PMD
mapped, so PageTransCompoundMap() always returns false for page cache THP.
This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up PTE
map like what anonymous THP does. Before KVM calls get_user_pages() the
subpages may get PTE mapped even though it is still a THP since the page
cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages
are PMD mapped by EPT as the below:
7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 557056 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
271
And the benchmarks are as same as anonymous THPs.
[yang.shi(a)linux.alibaba.com: v4]
Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.a…
Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.a…
Fixes: dd78fedde4b9 ("rmap: support file thp")
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Reported-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Tested-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 5 -----
include/linux/mm_types.h | 5 +++++
include/linux/page-flags.h | 20 ++++++++++++++++++--
3 files changed, 23 insertions(+), 7 deletions(-)
--- a/include/linux/mm.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap
+++ a/include/linux/mm.h
@@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, s
extern void kvfree(const void *addr);
-static inline atomic_t *compound_mapcount_ptr(struct page *page)
-{
- return &page[1].compound_mapcount;
-}
-
static inline int compound_mapcount(struct page *page)
{
VM_BUG_ON_PAGE(!PageCompound(page), page);
--- a/include/linux/mm_types.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap
+++ a/include/linux/mm_types.h
@@ -221,6 +221,11 @@ struct page {
#endif
} _struct_page_alignment;
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+ return &page[1].compound_mapcount;
+}
+
/*
* Used for sizing the vmemmap region on some architectures
*/
--- a/include/linux/page-flags.h~mm-thp-handle-page-cache-thp-correctly-in-pagetranscompoundmap
+++ a/include/linux/page-flags.h
@@ -622,12 +622,28 @@ static inline int PageTransCompound(stru
*
* Unlike PageTransCompound, this is safe to be called only while
* split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
* positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount to filter out PTE mapped case.
*/
static inline int PageTransCompoundMap(struct page *page)
{
- return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0;
+ struct page *head;
+
+ if (!PageTransCompound(page))
+ return 0;
+
+ if (PageAnon(page))
+ return atomic_read(&page->_mapcount) < 0;
+
+ head = compound_head(page);
+ /* File THP is PMD mapped and not PTE mapped */
+ return atomic_read(&page->_mapcount) ==
+ atomic_read(compound_mapcount_ptr(head));
}
/*
_
From: Mel Gorman <mgorman(a)techsingularity.net>
Subject: mm, meminit: recalculate pcpu batch and high limits after init completes
Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page allocator
(pcpu) calculates the number of pages allocated/freed in batches as well
as the maximum number of pages allowed on a per-cpu list. As
zone->managed_pages is not up to date yet, the pcpu initialisation
calculates inappropriately low batch and high values.
This increases zone lock contention quite severely in some cases with the
degree of severity depending on how many CPUs share a local zone and the
size of the zone. A private report indicated that kernel build times were
excessive with extremely high system CPU usage. A perf profile indicated
that a large chunk of time was lost on zone->lock contention.
This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system. It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.
mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.
kernbench
5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%*
Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%*
Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%*
Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%)
Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%)
Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%)
5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Duration User 39766.24 49221.79
Duration System 44298.10 13361.67
Duration Elapsed 519.11 388.87
The patch reduces system CPU usage by 69.86% and total build time by
26.65%. The variance of system CPU usage is also much reduced.
Before, this was the breakdown of batch and high values over all zones was.
256 batch: 1
256 batch: 63
512 batch: 7
256 high: 0
256 high: 378
512 high: 42
512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After
the patch:
256 batch: 1
768 batch: 63
256 high: 0
768 high: 378
[mgorman(a)techsingularity.net: fix merge/linkage snafu]
Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Matt Fleming <matt(a)codeblueprint.co.uk>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
--- a/mm/page_alloc.c~mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes
+++ a/mm/page_alloc.c
@@ -1948,6 +1948,14 @@ void __init page_alloc_init_late(void)
wait_for_completion(&pgdat_init_all_done_comp);
/*
+ * The number of managed pages has changed due to the initialisation
+ * so the pcpu batch and high limits needs to be updated or the limits
+ * will be artificially small.
+ */
+ for_each_populated_zone(zone)
+ zone_pcp_update(zone);
+
+ /*
* We initialized the rest of the deferred pages. Permanently disable
* on-demand struct page initialization.
*/
@@ -8514,7 +8522,6 @@ void free_contig_range(unsigned long pfn
WARN(count != 0, "%d pages are still in use!\n", count);
}
-#ifdef CONFIG_MEMORY_HOTPLUG
/*
* The zone indicated has a new number of managed_pages; batch sizes and percpu
* page high values need to be recalulated.
@@ -8528,7 +8535,6 @@ void __meminit zone_pcp_update(struct zo
per_cpu_ptr(zone->pageset, cpu));
mutex_unlock(&pcp_batch_high_lock);
}
-#endif
void zone_pcp_reset(struct zone *zone)
{
_
On Thu, Oct 31, 2019 at 12:40:36PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 7f192e3cd316b fork: add clone3.
>
> The bot has tested the following trees: v5.3.8.
>
> v5.3.8: Failed to apply! Possible dependencies:
> 78f6face5af34 ("sched: add kernel-doc for struct clone_args")
>
>
> NOTE: The patch will not be queued to stable trees until it is upstream.
>
> How should we proceed with this patch?
Hey Sasha,
This has now landed in mainline (cf. [2]).
I would suggest to backport [1] together with [2].
The patch in [1] only documents struct clone_args and has no functional
changes.
If you prefer to only backport a v5.3 specific version of [2] you can
find it inline (cf. [3]) including the base commit info for the 5.3 stable
tree.
Christian
[1]: 78f6face5af3 ("sched: add kernel-doc for struct clone_args")
[2]: fa729c4df558 ("clone3: validate stack arguments")
[3]:
>From 5bc5279d0dfa90cc6af385b6e3f65958f223ccab Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner(a)ubuntu.com>
Date: Thu, 31 Oct 2019 12:36:08 +0100
Subject: [PATCH] clone3: validate stack arguments
Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.
Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:
#define __STACK_SIZE (8 * 1024 * 1024)
pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
{
pid_t ret;
void *stack;
stack = malloc(__STACK_SIZE);
if (!stack)
return -ENOMEM;
#ifdef __ia64__
ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#elif defined(__parisc__) /* stack grows up */
ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
#else
ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#endif
return ret;
}
or even crazier variants such as [3].
With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.
/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().
[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17…
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa…
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd316 ("fork: add clone3")
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: linux-api(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: <stable(a)vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha(a)sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Acked-by: Aleksa Sarai <cyphar(a)cyphar.com>
Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.c…
---
kernel/fork.c | 33 ++++++++++++++++++++++++++++++++-
1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3647097e6783..8bbd39585301 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2586,7 +2586,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
return 0;
}
-static bool clone3_args_valid(const struct kernel_clone_args *kargs)
+/**
+ * clone3_stack_valid - check and prepare stack
+ * @kargs: kernel clone args
+ *
+ * Verify that the stack arguments userspace gave us are sane.
+ * In addition, set the stack direction for userspace since it's easy for us to
+ * determine.
+ */
+static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
+{
+ if (kargs->stack == 0) {
+ if (kargs->stack_size > 0)
+ return false;
+ } else {
+ if (kargs->stack_size == 0)
+ return false;
+
+ if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
+ return false;
+
+#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
+ kargs->stack += kargs->stack_size;
+#endif
+ }
+
+ return true;
+}
+
+static bool clone3_args_valid(struct kernel_clone_args *kargs)
{
/*
* All lower bits of the flag word are taken.
@@ -2606,6 +2634,9 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
kargs->exit_signal)
return false;
+ if (!clone3_stack_valid(kargs))
+ return false;
+
return true;
}
base-commit: db0655e705be645ad673b0a70160921e088517c0
--
2.23.0
On Tue, Nov 05, 2019 at 08:39:12AM +0100, Marta Rybczynska wrote:
> Looks good to me. However, please note that the new ioctl made it already to 5.3.8.
It wasn't in 5.3, but it seems like you are right and it somehow got
picked for the stable releases.
Sasha, can you please revert 76d609da9ed1cc0dc780e2b539d7b827ce28f182
in 5.3-stable ASAP and make sure crap like backporting new ABIs that
haven't seen a release yet is never ever going to happen again?