The patch titled
Subject: mm/memory.c: avoid access flag update TLB flush for retried page fault
has been removed from the -mm tree. Its filename was
mm-avoid-access-flag-update-tlb-flush-for-retried-page-fault.patch
This patch was dropped because it was nacked
------------------------------------------------------
From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm/memory.c: avoid access flag update TLB flush for retried page fault
Recently we found regression when running will_it_scale/page_fault3 test
on ARM64. Over 70% down for the multi processes cases and over 20% down
for the multi threads cases. It turns out the regression is caused by
commit 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before
calling balance_dirty_pages() in write fault").
The test mmaps a memory size file then write to the mapping, this would
make all memory dirty and trigger dirty pages throttle, that upstream
commit would release mmap_sem then retry the page fault. The retried page
fault would see correct PTEs installed by the first try then update dirty
bit and clear read-only bit and flush TLBs for ARM. The regression is
caused by the excessive TLB flush. It is fine on x86 since x86 doesn't
clear read-only bit so there is no need to flush TLB for this case.
The page fault would be retried due to:
1. Waiting for page readahead
2. Waiting for page swapped in
3. Waiting for dirty pages throttling
The first two cases don't have PTEs set up at all, so the retried page
fault would install the PTEs, so they don't reach there. But the #3 case
usually has PTEs installed, the retried page fault would reach the dirty
bit and read-only bit update. But it seems not necessary to modify those
bits again for #3 since they should be already set by the first page fault
try.
Of course the parallel page fault may set up PTEs, but we just need care
about write fault. If the parallel page fault setup a writable and dirty
PTE then the retried fault doesn't need do anything extra. If the
parallel page fault setup a clean read-only PTE, the retried fault should
just call do_wp_page() then return as the below code snippet shows:
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
}
With this fix the test result get back to normal.
[yang.shi(a)linux.alibaba.com: incorporate comment from Will Deacon, update commit log per discussion]
Link: http://lkml.kernel.org/r/1594848990-55657-1-git-send-email-yang.shi@linux.a…
Link: http://lkml.kernel.org/r/1594148072-91273-1-git-send-email-yang.shi@linux.a…
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Reported-by: Xu Yu <xuyu(a)linux.alibaba.com>
Debugged-by: Xu Yu <xuyu(a)linux.alibaba.com>
Tested-by: Xu Yu <xuyu(a)linux.alibaba.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Josef Bacik <josef(a)toxicpanda.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/mm/memory.c~mm-avoid-access-flag-update-tlb-flush-for-retried-page-fault
+++ a/mm/memory.c
@@ -4241,8 +4241,14 @@ static vm_fault_t handle_pte_fault(struc
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
- entry = pte_mkdirty(entry);
}
+
+ if (vmf->flags & FAULT_FLAG_TRIED)
+ goto unlock;
+
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ entry = pte_mkdirty(entry);
+
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
vmf->flags & FAULT_FLAG_WRITE)) {
_
Patches currently in -mm which might be from yang.shi(a)linux.alibaba.com are
mm-filemap-clear-idle-flag-for-writes.patch
mm-filemap-add-missing-fgp_-flags-in-kerneldoc-comment-for-pagecache_get_page.patch
mm-thp-remove-debug_cow-switch.patch
If the link register was zeroed out, do not attempt to use it for
address calculations for which there are currently no fixup handlers,
which can lead to a panic during unwind. Since panicking triggers
another unwind, this can lead to an infinite loop. If this occurs
during start_kernel(), this can prevent a kernel from booting.
commit 59b6359dd92d ("ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()")
intentionally zeros out the link register in __mmap_switched which tail
calls into start kernel. Test for this condition so that we can stop
unwinding when initiated within start_kernel() correctly.
Cc: stable(a)vger.kernel.org
Fixes: commit 6dc5fd93b2f1 ("ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang")
Reported-by: Miles Chen <miles.chen(a)mediatek.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
arch/arm/lib/backtrace-clang.S | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm/lib/backtrace-clang.S b/arch/arm/lib/backtrace-clang.S
index 6174c45f53a5..5388ac664c12 100644
--- a/arch/arm/lib/backtrace-clang.S
+++ b/arch/arm/lib/backtrace-clang.S
@@ -144,6 +144,8 @@ for_each_frame: tst frame, mask @ Check for address exceptions
*/
1003: ldr sv_lr, [sv_fp, #4] @ get saved lr from next frame
+ tst sv_lr, #0 @ If there's no previous lr,
+ beq finished_setup @ we're done.
ldr r0, [sv_lr, #-4] @ get call instruction
ldr r3, .Lopcode+4
and r2, r3, r0 @ is this a bl call
--
2.28.0.163.g6104cc2f0b6-goog
Hi Andy,
On Fri, Aug 07, 2020 at 10:55:11AM -0700, Andy Lutomirski wrote:
> >> This is still another non-cryptographic PRNG.
> >
> > Absolutely. During some discussions regarding the possibility of using
> > CSPRNGs, orders around hundreds of CPU cycles were mentioned for them,
> > which can definitely be a huge waste of precious resources for some
> > workloads, possibly causing the addition of a few percent extra machines
> > in certain environments just to keep the average load under a certain
> > threshold.
>
> I think the real random.c can run plenty fast. It's ChaCha20 plus ludicrous
> overhead right now. I'm working (slowly) on making the overhead go away. I'm
> hoping to have something testable in a few days. As it stands, there is a
> ton of indirection, a pile of locks, multiple time comparisons, per-node and
> percpu buffers (why both?), wasted bits due to alignment, and probably other
> things that can be cleaned up. I'm trying to come up with something that is
> fast and has easy-to-understand semantics.
>
> You can follow along at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=rando…
Thanks, we'll see. I developed a quick test tool that's meant to be easy
to use to measure the performance impact on connect/accept. I have not
yet run it on a modified PRNG to verify if it works. I'll send it once
I've tested. I'd definitely would like to see no measurable performance
drop, and ideally even a small performance increase (as Tausworthe isn't
the lightest thing around either so we do have some little margin).
Willy
The patch below does not apply to the 5.8-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d1719f70d0a5b83b12786a7dbc5b9fe396469016 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 30 Jul 2020 13:43:53 -0600
Subject: [PATCH] io_uring: don't touch 'ctx' after installing file descriptor
As soon as we install the file descriptor, we have to assume that it
can get arbitrarily closed. We currently account memory (and note that
we did) after installing the ring fd, which means that it could be a
potential use-after-free condition if the fd is closed right after
being installed, but before we fiddle with the ctx.
In fact, syzbot reported this exact scenario:
BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x18f/0x20d lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
io_account_mem fs/io_uring.c:7397 [inline]
io_uring_create fs/io_uring.c:8369 [inline]
io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c429
Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
Move the accounting of the ring used locked memory before we get and
install the ring file descriptor.
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+9d46305e76057f30c74e(a)syzkaller.appspotmail.com
Fixes: 309758254ea6 ("io_uring: report pinned memory usage")
Reviewed-by: Stefano Garzarella <sgarzare(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fabf0b692384..33702f3b5af8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8329,6 +8329,15 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
ret = -EFAULT;
goto err;
}
+
+ /*
+ * Account memory _before_ installing the file descriptor. Once
+ * the descriptor is installed, it can get closed at any time.
+ */
+ io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries),
+ ACCT_LOCKED);
+ ctx->limit_mem = limit_mem;
+
/*
* Install ring fd as the very last thing, so we don't risk someone
* having closed it before we finish setup
@@ -8338,9 +8347,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
goto err;
trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
- io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries),
- ACCT_LOCKED);
- ctx->limit_mem = limit_mem;
return ret;
err:
io_ring_ctx_wait_and_kill(ctx);