The patch titled
Subject: selftests/vm: fix map_hugetlb length used for testing read and write
has been added to the -mm tree. Its filename is
selftests-vm-fix-map_hugetlb-length-used-for-testing-read-and-write.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/selftests-vm-fix-map_hugetlb-lengt…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/selftests-vm-fix-map_hugetlb-lengt…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Christophe Leroy <christophe.leroy(a)c-s.fr>
Subject: selftests/vm: fix map_hugetlb length used for testing read and write
Commit fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page
size in map_hugetlb") added the possibility to change the size of memory
mapped for the test, but left the read and write test using the default
value. This is unnoticed when mapping a length greater than the default
one, but segfaults otherwise.
Fix read_bytes() and write_bytes() by giving them the real length.
Also fix the call to munmap().
Link: http://lkml.kernel.org/r/9a404a13c871c4bd0ba9ede68f69a1225180dd7e.158097838…
Fixes: fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb")
Signed-off-by: Christophe Leroy <christophe.leroy(a)c-s.fr>
Reviewed-by: Leonardo Bras <leonardo(a)linux.ibm.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/vm/map_hugetlb.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
--- a/tools/testing/selftests/vm/map_hugetlb.c~selftests-vm-fix-map_hugetlb-length-used-for-testing-read-and-write
+++ a/tools/testing/selftests/vm/map_hugetlb.c
@@ -45,20 +45,20 @@ static void check_bytes(char *addr)
printf("First hex is %x\n", *((unsigned int *)addr));
}
-static void write_bytes(char *addr)
+static void write_bytes(char *addr, size_t length)
{
unsigned long i;
- for (i = 0; i < LENGTH; i++)
+ for (i = 0; i < length; i++)
*(addr + i) = (char)i;
}
-static int read_bytes(char *addr)
+static int read_bytes(char *addr, size_t length)
{
unsigned long i;
check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
+ for (i = 0; i < length; i++)
if (*(addr + i) != (char)i) {
printf("Mismatch at %lu\n", i);
return 1;
@@ -96,11 +96,11 @@ int main(int argc, char **argv)
printf("Returned address is %p\n", addr);
check_bytes(addr);
- write_bytes(addr);
- ret = read_bytes(addr);
+ write_bytes(addr, length);
+ ret = read_bytes(addr, length);
/* munmap() length of MAP_HUGETLB memory must be hugepage aligned */
- if (munmap(addr, LENGTH)) {
+ if (munmap(addr, length)) {
perror("munmap");
exit(1);
}
_
Patches currently in -mm which might be from christophe.leroy(a)c-s.fr are
selftests-vm-fix-map_hugetlb-length-used-for-testing-read-and-write.patch
From: Eric Biggers <ebiggers(a)google.com>
After request_module(), nothing is stopping the module from being
unloaded until someone takes a reference to it via try_get_module().
The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace
running 'rmmod' concurrently.
Since WARN_ONCE() is for kernel bugs only, not for user-reachable
situations, downgrade this warning to pr_warn_once().
Cc: stable(a)vger.kernel.org
Cc: Alexei Starovoitov <ast(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv(a)google.com>
Cc: Jessica Yu <jeyu(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: NeilBrown <neilb(a)suse.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
fs/filesystems.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/filesystems.c b/fs/filesystems.c
index 77bf5f95362da..90b8d879fbaf3 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -272,7 +272,9 @@ struct file_system_type *get_fs_type(const char *name)
fs = __get_fs_type(name, len);
if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
fs = __get_fs_type(name, len);
- WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
+ if (!fs)
+ pr_warn_once("request_module fs-%.*s succeeded, but still no fs?\n",
+ len, name);
}
if (dot && fs && !(fs->fs_flags & FS_HAS_SUBTYPE)) {
--
2.25.1
From: Eric Biggers <ebiggers(a)google.com>
It's long been possible to disable kernel module autoloading completely
(while still allowing manual module insertion) by setting
/proc/sys/kernel/modprobe to the empty string. This can be preferable
to setting it to a nonexistent file since it avoids the overhead of an
attempted execve(), avoids potential deadlocks, and avoids the call to
security_kernel_module_request() and thus on SELinux-based systems
eliminates the need to write SELinux rules to dontaudit module_request.
However, when module autoloading is disabled in this way,
request_module() returns 0. This is broken because callers expect 0 to
mean that the module was successfully loaded.
Apparently this was never noticed because this method of disabling
module autoloading isn't used much, and also most callers don't use the
return value of request_module() since it's always necessary to check
whether the module registered its functionality or not anyway. But
improperly returning 0 can indeed confuse a few callers, for example
get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:
if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
fs = __get_fs_type(name, len);
WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
}
This is easily reproduced with:
echo > /proc/sys/kernel/modprobe
mount -t NONEXISTENT none /
It causes:
request_module fs-NONEXISTENT succeeded, but still no fs?
WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
[...]
This should actually use pr_warn_once() rather than WARN_ONCE(), since
it's also user-reachable if userspace immediately unloads the module.
Regardless, request_module() should correctly return an error when it
fails. So let's make it return -ENOENT, which matches the error when
the modprobe binary doesn't exist.
I've also sent patches to document and test this case.
Cc: stable(a)vger.kernel.org
Cc: Alexei Starovoitov <ast(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv(a)google.com>
Cc: Jessica Yu <jeyu(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: NeilBrown <neilb(a)suse.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
kernel/kmod.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/kmod.c b/kernel/kmod.c
index bc6addd9152b4..a2de58de6ab62 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -120,7 +120,7 @@ static int call_modprobe(char *module_name, int wait)
* invoke it.
*
* If module auto-loading support is disabled then this function
- * becomes a no-operation.
+ * simply returns -ENOENT.
*/
int __request_module(bool wait, const char *fmt, ...)
{
@@ -137,7 +137,7 @@ int __request_module(bool wait, const char *fmt, ...)
WARN_ON_ONCE(wait && current_is_async());
if (!modprobe_path[0])
- return 0;
+ return -ENOENT;
va_start(args, fmt);
ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
--
2.25.1
The patch titled
Subject: mm: do not allow MADV_PAGEOUT for CoW pages
has been added to the -mm tree. Its filename is
mm-do-not-allow-madv_pageout-for-cow-pages.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-do-not-allow-madv_pageout-for-c…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-do-not-allow-madv_pageout-for-c…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm: do not allow MADV_PAGEOUT for CoW pages
Jann has brought up a very interesting point [1]. While shared pages are
excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
that way. This can lead to all sorts of hard to debug problems. E.g.
performance problems outlined by Daniel [2].
There are runtime environments where there is a substantial memory shared
among security domains via CoW memory and a easy to reclaim way of that
memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance
degradation in for the parent process which might be more privileged or
even open side channel attacks.
The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage. It seems there is no real use
case to depend on reclaiming CoW memory via madvise at this stage so it is
much easier to simply disallow it and this is what this patch does. Put
it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned
memory which is a straightforward semantic.
[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKu…
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zR…
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Jann Horn <jannh(a)google.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: "Joel Fernandes (Google)" <joel(a)joelfernandes.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/madvise.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
--- a/mm/madvise.c~mm-do-not-allow-madv_pageout-for-cow-pages
+++ a/mm/madvise.c
@@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_r
}
page = pmd_page(orig_pmd);
+
+ /* Do not interfere with other mappings of this page */
+ if (page_mapcount(page) != 1)
+ goto huge_unlock;
+
if (next - addr != HPAGE_PMD_SIZE) {
int err;
- if (page_mapcount(page) != 1)
- goto huge_unlock;
-
get_page(page);
spin_unlock(ptl);
lock_page(page);
@@ -426,6 +428,10 @@ regular_page:
continue;
}
+ /* Do not interfere with other mappings of this page */
+ if (page_mapcount(page) != 1)
+ continue;
+
VM_BUG_ON_PAGE(PageTransCompound(page), page);
if (pte_young(ptent)) {
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-do-not-allow-madv_pageout-for-cow-pages.patch
The patch titled
Subject: mm, memcg: throttle allocators based on ancestral memory.high
has been added to the -mm tree. Its filename is
mm-memcg-throttle-allocators-based-on-ancestral-memoryhigh.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-throttle-allocators-based…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-throttle-allocators-based…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Chris Down <chris(a)chrisdown.name>
Subject: mm, memcg: throttle allocators based on ancestral memory.high
Prior to this commit, we only directly check the affected cgroup's
memory.high against its usage. However, it's possible that we are being
reclaimed as a result of hitting an ancestor memory.high and should be
penalised based on that, instead.
This patch changes memory.high overage throttling to use the largest
overage in its ancestors when considering how many penalty jiffies to
charge. This makes sure that we penalise poorly behaving cgroups in the
same way regardless of at what level of the hierarchy memory.high was
breached.
Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.158403614…
Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner <hannes(a)cmpxchg.org>
Signed-off-by: Chris Down <chris(a)chrisdown.name>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [5.4.x+]
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Nathan Chancellor <natechancellor(a)gmail.com>
Cc: Roman Gushchin <guro(a)fb.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 93 ++++++++++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 35 deletions(-)
--- a/mm/memcontrol.c~mm-memcg-throttle-allocators-based-on-ancestral-memoryhigh
+++ a/mm/memcontrol.c
@@ -2308,28 +2308,41 @@ static void high_work_func(struct work_s
#define MEMCG_DELAY_SCALING_SHIFT 14
/*
- * Scheduled by try_charge() to be executed from the userland return path
- * and reclaims memory over the high limit.
+ * Get the number of jiffies that we should penalise a mischievous cgroup which
+ * is exceeding its memory.high by checking both it and its ancestors.
*/
-void mem_cgroup_handle_over_high(void)
+static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
{
- unsigned long usage, high, clamped_high;
- unsigned long pflags;
- unsigned long penalty_jiffies, overage;
- unsigned int nr_pages = current->memcg_nr_pages_over_high;
- struct mem_cgroup *memcg;
+ unsigned long penalty_jiffies;
+ u64 max_overage = 0;
- if (likely(!nr_pages))
- return;
+ do {
+ unsigned long usage, high;
+ u64 overage;
+
+ usage = page_counter_read(&memcg->memory);
+ high = READ_ONCE(memcg->high);
+
+ /*
+ * Prevent division by 0 in overage calculation by acting as if
+ * it was a threshold of 1 page
+ */
+ high = max(high, 1UL);
+
+ overage = usage - high;
+ overage <<= MEMCG_DELAY_PRECISION_SHIFT;
+ overage = div64_u64(overage, high);
+
+ if (overage > max_overage)
+ max_overage = overage;
+ } while ((memcg = parent_mem_cgroup(memcg)) &&
+ !mem_cgroup_is_root(memcg));
- memcg = get_mem_cgroup_from_mm(current->mm);
- reclaim_high(memcg, nr_pages, GFP_KERNEL);
- current->memcg_nr_pages_over_high = 0;
+ if (!max_overage)
+ return 0;
/*
- * memory.high is breached and reclaim is unable to keep up. Throttle
- * allocators proactively to slow down excessive growth.
- *
* We use overage compared to memory.high to calculate the number of
* jiffies to sleep (penalty_jiffies). Ideally this value should be
* fairly lenient on small overages, and increasingly harsh when the
@@ -2337,24 +2350,9 @@ void mem_cgroup_handle_over_high(void)
* its crazy behaviour, so we exponentially increase the delay based on
* overage amount.
*/
-
- usage = page_counter_read(&memcg->memory);
- high = READ_ONCE(memcg->high);
-
- if (usage <= high)
- goto out;
-
- /*
- * Prevent division by 0 in overage calculation by acting as if it was a
- * threshold of 1 page
- */
- clamped_high = max(high, 1UL);
-
- overage = div64_u64((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT,
- clamped_high);
-
- penalty_jiffies = ((u64)overage * overage * HZ)
- >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
+ penalty_jiffies = max_overage * max_overage * HZ;
+ penalty_jiffies >>= MEMCG_DELAY_PRECISION_SHIFT;
+ penalty_jiffies >>= MEMCG_DELAY_SCALING_SHIFT;
/*
* Factor in the task's own contribution to the overage, such that four
@@ -2371,7 +2369,32 @@ void mem_cgroup_handle_over_high(void)
* application moving forwards and also permit diagnostics, albeit
* extremely slowly.
*/
- penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+ return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+}
+
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+ unsigned long penalty_jiffies;
+ unsigned long pflags;
+ unsigned int nr_pages = current->memcg_nr_pages_over_high;
+ struct mem_cgroup *memcg;
+
+ if (likely(!nr_pages))
+ return;
+
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ reclaim_high(memcg, nr_pages, GFP_KERNEL);
+ current->memcg_nr_pages_over_high = 0;
+
+ /*
+ * memory.high is breached and reclaim is unable to keep up. Throttle
+ * allocators proactively to slow down excessive growth.
+ */
+ penalty_jiffies = calculate_high_delay(memcg, nr_pages);
/*
* Don't sleep if the amount of jiffies this memcg owes us is so low
_
Patches currently in -mm which might be from chris(a)chrisdown.name are
mm-memcg-fix-corruption-on-64-bit-divisor-in-memoryhigh-throttling.patch
mm-memcg-throttle-allocators-based-on-ancestral-memoryhigh.patch