The patch titled
Subject: userfaultfd: fix a race between writeprotect and exit_mmap()
has been added to the -mm tree. Its filename is
userfaultfd-fix-a-race-between-writeprotect-and-exit_mmap.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/userfaultfd-fix-a-race-between-wr…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/userfaultfd-fix-a-race-between-wr…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Nadav Amit <namit(a)vmware.com>
Subject: userfaultfd: fix a race between writeprotect and exit_mmap()
A race is possible when a process exits, its VMAs are removed by
exit_mmap() and at the same time userfaultfd_writeprotect() is called.
The race was detected by KASAN on a development kernel, but it appears to
be possible on vanilla kernels as well.
Use mmget_not_zero() to prevent the race as done in other userfaultfd
operations.
Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Signed-off-by: Nadav Amit <namit(a)vmware.com>
Tested-by: Li Wang <liwang(a)redhat.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
--- a/fs/userfaultfd.c~userfaultfd-fix-a-race-between-writeprotect-and-exit_mmap
+++ a/fs/userfaultfd.c
@@ -1827,9 +1827,15 @@ static int userfaultfd_writeprotect(stru
if (mode_wp && mode_dontwake)
return -EINVAL;
- ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
- uffdio_wp.range.len, mode_wp,
- &ctx->mmap_changing);
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+ uffdio_wp.range.len, mode_wp,
+ &ctx->mmap_changing);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
if (ret)
return ret;
_
Patches currently in -mm which might be from namit(a)vmware.com are
userfaultfd-fix-a-race-between-writeprotect-and-exit_mmap.patch
The patch titled
Subject: mm: fix uninitialized use in overcommit_policy_handler
has been added to the -mm tree. Its filename is
mm-fix-the-uninitialized-use-in-overcommit_policy_handler.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-fix-the-uninitialized-use-in-o…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-fix-the-uninitialized-use-in-o…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Chen Jun <chenjun102(a)huawei.com>
Subject: mm: fix uninitialized use in overcommit_policy_handler
We get an unexpected value of /proc/sys/vm/overcommit_memory after running
the following program:
int main()
{
int fd = open("/proc/sys/vm/overcommit_memory", O_RDWR)
write(fd, "1", 1);
write(fd, "2", 1);
close(fd);
}
write(fd, "2", 1) will pass *ppos = 1 to proc_dointvec_minmax.
proc_dointvec_minmax will return 0 without setting new_policy.
t.data = &new_policy;
ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos)
-->do_proc_dointvec
-->__do_proc_dointvec
if (write) {
if (proc_first_pos_non_zero_ignore(ppos, table))
goto out;
sysctl_overcommit_memory = new_policy;
so sysctl_overcommit_memory will be set to an uninitialized value.
Check whether new_policy has been changed by proc_dointvec_minmax.
Link: https://lkml.kernel.org/r/20210923020524.13289-1-chenjun102@huawei.com
Fixes: 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy"
Signed-off-by: Chen Jun <chenjun102(a)huawei.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Feng Tang <feng.tang(a)intel.com>
Cc: Rui Xiang <rui.xiang(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/util.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/util.c~mm-fix-the-uninitialized-use-in-overcommit_policy_handler
+++ a/mm/util.c
@@ -787,7 +787,7 @@ int overcommit_policy_handler(struct ctl
size_t *lenp, loff_t *ppos)
{
struct ctl_table t;
- int new_policy;
+ int new_policy = -1;
int ret;
/*
@@ -805,7 +805,7 @@ int overcommit_policy_handler(struct ctl
t = *table;
t.data = &new_policy;
ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
- if (ret)
+ if (ret || new_policy == -1)
return ret;
mm_compute_batch(new_policy);
_
Patches currently in -mm which might be from chenjun102(a)huawei.com are
mm-fix-the-uninitialized-use-in-overcommit_policy_handler.patch
The patch titled
Subject: mm/userfaultfd: selftests: fix memory corruption with thp enabled
has been added to the -mm tree. Its filename is
mm-userfaultfd-selftests-fix-memory-corruption-with-thp-enabled.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-userfaultfd-selftests-fix-memo…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-userfaultfd-selftests-fix-memo…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/userfaultfd: selftests: fix memory corruption with thp enabled
In RHEL's gating selftests we've encountered memory corruption in the uffd
event test even with upstream kernel:
# ./userfaultfd anon 128 4
nr_pages: 32768, nr_pages_per_cpu: 32768
bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729)
bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877)
bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699)
bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196)
testing uffd-wp with pagemap (pgsize=4096): done
testing uffd-wp with pagemap (pgsize=2097152): done
testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963)
ERROR: faulting process failed (errno=0, line=1117)
It can be easily reproduced when global thp enabled, which is the default for
RHEL.
It's also known as a side effect of commit 0db282ba2c12 ("selftest: use
mmap instead of posix_memalign to allocate memory", 2021-07-23), which is
imho right itself on using mmap() to make sure the addresses will be
untagged even on arm.
The problem is, for each test we allocate buffers using two
allocate_area() calls. We assumed these two buffers won't affect each
other, however they could, because mmap() could have found that the two
buffers are near each other and having the same VMA flags, so they got
merged into one VMA.
It won't be a big problem if thp is not enabled, but when thp is
agressively enabled it means when initializing the src buffer it could
accidentally setup part of the dest buffer too when there's a shared THP
that overlaps the two regions. Then some of the dest buffer won't be able
to be trapped by userfaultfd missing mode, then it'll cause memory
corruption as described.
To fix it, do release_pages() after initializing the src buffer.
Since the previous two release_pages() calls are after
uffd_test_ctx_clear() which will unmap all the buffers anyway (which is
stronger than release pages; as unmap() also tear town pgtables), drop
them as they shouldn't really be anything useful.
We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only
happen there, however the real "Fixes" IMHO should be 8ba6e8640844, as
before that commit we'll always do explicit release_pages() before
registration of uffd, and 8ba6e8640844 changed that logic by adding extra
unmap/map and we didn't release the pages at the right place. Meanwhile I
don't have a solid glue anyway on whether posix_memalign() could always
avoid triggering this bug, hence it's safer to attach this fix to commit
8ba6e8640844.
Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com
Fixes: 8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each test")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reported-by: Li Wang <liwan(a)redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/vm/userfaultfd.c | 23 ++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
--- a/tools/testing/selftests/vm/userfaultfd.c~mm-userfaultfd-selftests-fix-memory-corruption-with-thp-enabled
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint6
uffd_test_ops->allocate_area((void **)&area_src);
uffd_test_ops->allocate_area((void **)&area_dst);
- uffd_test_ops->release_pages(area_src);
- uffd_test_ops->release_pages(area_dst);
-
userfaultfd_open(features);
count_verify = malloc(nr_pages * sizeof(unsigned long long));
@@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint6
*(area_count(area_src, nr) + 1) = 1;
}
+ /*
+ * After initialization of area_src, we must explicitly release pages
+ * for area_dst to make sure it's fully empty. Otherwise we could have
+ * some area_dst pages be errornously initialized with zero pages,
+ * hence we could hit memory corruption later in the test.
+ *
+ * One example is when THP is globally enabled, above allocate_area()
+ * calls could have the two areas merged into a single VMA (as they
+ * will have the same VMA flags so they're mergeable). When we
+ * initialize the area_src above, it's possible that some part of
+ * area_dst could have been faulted in via one huge THP that will be
+ * shared between area_src and area_dst. It could cause some of the
+ * area_dst won't be trapped by missing userfaults.
+ *
+ * This release_pages() will guarantee even if that happened, we'll
+ * proactively split the thp and drop any accidentally initialized
+ * pages within area_dst.
+ */
+ uffd_test_ops->release_pages(area_dst);
+
pipefd = malloc(sizeof(int) * nr_cpus * 2);
if (!pipefd)
err("pipefd");
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-userfaultfd-selftests-fix-memory-corruption-with-thp-enabled.patch
mm-smaps-fix-shmem-pte-hole-swap-calculation.patch
mm-smaps-use-vma-vm_pgoff-directly-when-counting-partial-swap.patch
mm-smaps-simplify-shmem-handling-of-pte-holes.patch
mm-memcg-drop-swp_entry_t-in-mc_handle_file_pte.patch
mm-shmem-unconditionally-set-pte-dirty-in-mfill_atomic_install_pte.patch
mm-clear-vmf-pte-after-pte_unmap_same-returns.patch
mm-drop-first_index-last_index-in-zap_details.patch
mm-add-zap_skip_check_mapping-helper.patch
mm-hugetlb-drop-__unmap_hugepage_range-definition-from-hugetlbh.patch
On Sun, Jan 03, 2021 at 07:25:36PM +0100, Helge Deller wrote:
> On 1/3/21 3:27 PM, kernel test robot wrote:
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 30a3a192730a997bc4afff5765254175b6fb64f3 ("[PATCH] proc/wchan: Use printk format instead of lookup_symbol_name()")
> > url: https://github.com/0day-ci/linux/commits/Helge-Deller/proc-wchan-Use-printk…
> > base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 09162bc32c880a791c6c0668ce0745cf7958f576
> >
> > in testcase: leaking-addresses
> > version: leaking-addresses-x86_64-4f19048-1_20201111
> > [...]
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
> I don't see anything wrong with the wchan patch (30a3a192730a997bc4afff5765254175b6fb64f3),
> or that it could have leaked anything.
>
> Maybe the kernel test robot picked up the wchan patch by mistake ?
>
> > [...]
> > [2 wchan] 0xffffc9000000003c
^^^^^
As the root cause of a kernel address exposure, Jann pointed out[2]
commit 152c432b128c, which I've tracked to here, only to discover this
regression was, indeed, reported. :(
So, we have a few things:
1) wchan has been reporting "0" in the default x86 config (ORC unwinder)
for 4 years now.
2) non-x86 or non-ORC, wchan has been leaking raw kernel addresses since
commit 152c432b128c (v5.12).
3) the output of scripts/leaking_addresses.pl is hard to read. :)
We can fix 1 and 2 with:
https://lore.kernel.org/lkml/20210923233105.4045080-1-keescook@chromium.org/
(though that will need a Cc: stable now...)
If we don't do that, we still need to revert 152c432b128c in v5.12 and
later.
We should likely make leaking_addresses.pl a little more readable while
we're at it.
-Kees
[1] https://lore.kernel.org/lkml/20210921193249.el476vlhg5k6lfcq@shells.gnugene…
[2] https://lore.kernel.org/lkml/CAG48ez2zC=+PuNgezH53HBPZ8CXU5H=vkWx7nJs60G8RX…
--
Kees Cook
On Wed, Sep 22, 2021 at 05:06:53PM +0800, Li Wang wrote:
> Hi,
Li,
>
> I confirmed this patch (applied on 5.14) gets rid of the below userfaultfd
> test failure.
>
> # ./userfaultfd anon 16 2
> nr_pages: 4096, nr_pages_per_cpu: 256
> bounces: 1, mode: rnd read, userfaults: 313 missing
> (51+34+37+26+41+28+15+20+16+12+13+7+10+2+0+1) 995 wp
> (121+79+96+53+90+104+48+61+56+82+56+41+49+26+11+22)
> bounces: 0, mode: read, userfaults: 64 missing
> (15+8+10+6+5+2+4+3+3+1+4+0+0+2+0+1) 2157 wp
> (223+274+189+141+116+132+203+153+143+126+110+114+101+66+42+24)
> testing uffd-wp with pagemap (pgsize=4096): done
> testing uffd-wp with pagemap (pgsize=2097152): done
> testing UFFDIO_ZEROPAGE: done.
> testing signal delivery: done.
> testing events (fork, remap, remove): ERROR: nr 3933 memory corruption 0 1
> (errno=0, line=963)
> ERROR: faulting process failed (errno=0, line=1117)
Just to keep a record within this thread - my understanding is above issue is a
separate issue from what Nadav has fixed. The other fix could be:
https://lore.kernel.org/lkml/20210923232512.210092-1-peterx@redhat.com/
When verify with Nadav's patch, please check whether you have thp enabled
globally:
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
Thanks,
--
Peter Xu
If the state is not idle then rdma_bind_addr() will immediately fail and
no change to global state should happen.
For instance if the state is already RDMA_CM_LISTEN then this will corrupt
the src_addr and would cause the test in cma_cancel_operation():
if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)
To view a mangled src_addr, eg with a IPv6 loopback address but an IPv4
family, failing the test.
This would manifest as this trace from syzkaller:
BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204
CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
__kasan_report mm/kasan/report.c:399 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
__list_add_valid+0x93/0xa0 lib/list_debug.c:26
__list_add include/linux/list.h:67 [inline]
list_add_tail include/linux/list.h:100 [inline]
cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
vfs_write+0x28e/0xa30 fs/read_write.c:603
ksys_write+0x1ee/0x250 fs/read_write.c:658
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
Which is likely indicating that an rdma_id_private was destroyed without
doing cma_cancel_listens().
Instead of trying to re-use the src_addr memory to indirectly create an
any address build one explicitly on the stack and bind to that as any
other normal flow would do.
Cc: stable(a)vger.kernel.org
Fixes: 732d41c545bb ("RDMA/cma: Make the locking for automatic state transition more clear")
Reported-by: syzbot+6bb0528b13611047209c(a)syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
---
drivers/infiniband/core/cma.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c40791baced588..a1315b4da1a6bf 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -3771,9 +3771,13 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
int ret;
if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_BOUND, RDMA_CM_LISTEN)) {
+ struct sockaddr_in any_in = {
+ .sin_family = AF_INET,
+ .sin_addr.s_addr = htonl(INADDR_ANY),
+ };
+
/* For a well behaved ULP state will be RDMA_CM_IDLE */
- id->route.addr.src_addr.ss_family = AF_INET;
- ret = rdma_bind_addr(id, cma_src_addr(id_priv));
+ ret = rdma_bind_addr(id, (struct sockaddr *)&any_in);
if (ret)
return ret;
if (WARN_ON(!cma_comp_exch(id_priv, RDMA_CM_ADDR_BOUND,
base-commit: ad17bbef3dd573da937816edc0ab84fed6a17fa6
--
2.33.0