From: Minchan Kim <minchan(a)kernel.org>
Subject: zram: fix broken page writeback
commit 0d8359620d9b ("zram: support page writeback") introduced two
problems. It overwrites writeback_store's return value as kstrtol's
return value, which makes return value zero so user could see zero as
return value of write syscall even though it wrote data successfully.
It also breaks index value in the loop in that it doesn't increase the
index any longer. It means it can write only first starting block index
so user couldn't write all idle pages in the zram so lose memory saving
chance.
This patch fixes those issues.
Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org
Fixes: 0d8359620d9b("zram: support page writeback")
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Reported-by: Amos Bianchi <amosbianchi(a)google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky(a)gmail.com>
Cc: John Dias <joaodias(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/block/zram/zram_drv.c~zram-fix-broken-page-writeback
+++ a/drivers/block/zram/zram_drv.c
@@ -638,8 +638,8 @@ static ssize_t writeback_store(struct de
if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1))
return -EINVAL;
- ret = kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index);
- if (ret || index >= nr_pages)
+ if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) ||
+ index >= nr_pages)
return -EINVAL;
nr_pages = 1;
@@ -663,7 +663,7 @@ static ssize_t writeback_store(struct de
goto release_init_lock;
}
- while (nr_pages--) {
+ for (; nr_pages != 0; index++, nr_pages--) {
struct bio_vec bvec;
bvec.bv_page = page;
_
From: Minchan Kim <minchan(a)kernel.org>
Subject: zram: fix return value on writeback_store
writeback_store's return value is overwritten by submit_bio_wait's return
value. Thus, writeback_store will return zero since there was no IO
error. In the end, write syscall from userspace will see the zero as
return value, which could make the process stall to keep trying the write
until it will succeed.
Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org
Fixes: 3b82a051c101("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store")
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky(a)gmail.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: John Dias <joaodias(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/drivers/block/zram/zram_drv.c~zram-fix-return-value-on-writeback_store
+++ a/drivers/block/zram/zram_drv.c
@@ -627,7 +627,7 @@ static ssize_t writeback_store(struct de
struct bio_vec bio_vec;
struct page *page;
ssize_t ret = len;
- int mode;
+ int mode, err;
unsigned long blk_idx = 0;
if (sysfs_streq(buf, "idle"))
@@ -728,12 +728,17 @@ static ssize_t writeback_store(struct de
* XXX: A single page IO would be inefficient for write
* but it would be not bad as starter.
*/
- ret = submit_bio_wait(&bio);
- if (ret) {
+ err = submit_bio_wait(&bio);
+ if (err) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_clear_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
+ /*
+ * Return last IO error unless every IO were
+ * not suceeded.
+ */
+ ret = err;
continue;
}
_
From: Zhou Guanghui <zhouguanghui1(a)huawei.com>
Subject: mm/memcg: set memcg when splitting page
As described in the split_page() comment, for the non-compound high order
page, the sub-pages must be freed individually. If the memcg of the first
page is valid, the tail pages cannot be uncharged when be freed.
For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set). When make_alloc_exact free the unused 1MB and free_pages_exact free
the applied 1MB, actually, only 4KB(one page) is uncharged.
Therefore, the memcg of the tail page needs to be set when splitting a
page.
Michel:
There are at least two explicit users of __GFP_ACCOUNT with
alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64:
Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
just a theoretical issue.
Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1(a)huawei.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Hanjun Guo <guohanjun(a)huawei.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: Rui Xiang <rui.xiang(a)huawei.com>
Cc: Tianhong Ding <dingtianhong(a)huawei.com>
Cc: Weilong Chen <chenweilong(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/page_alloc.c~mm-memcg-set-memcg-when-split-page
+++ a/mm/page_alloc.c
@@ -3314,6 +3314,7 @@ void split_page(struct page *page, unsig
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
_
From: Nadav Amit <namit(a)vmware.com>
Subject: mm/userfaultfd: fix memory corruption due to writeprotect
Userfaultfd self-test fails occasionally, indicating a memory corruption.
Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.
To make matters worse, memory-unprotection using userfaultfd also poses a
problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set. Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.
The scenario that happens in selftests/vm/userfaultfd is as follows:
cpu0 cpu1 cpu2
---- ---- ----
[ Writable PTE
cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()
change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
cow_user_page()
[ copy page ]
[ write to old
page ]
...
set_pte_at_notify()
A similar scenario can happen:
cpu0 cpu1 cpu2 cpu3
---- ---- ---- ----
[ Writable PTE
cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
cow_user_page()
[ copy page ]
... [ write to page ]
set_pte_at_notify()
This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.
To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.
Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.
Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit <namit(a)vmware.com>
Suggested-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Tested-by: Peter Xu <peterx(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Pavel Emelyanov <xemul(a)openvz.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/memory.c~mm-userfaultfd-fix-memory-corruption-due-to-writeprotect
+++ a/mm/memory.c
@@ -3097,6 +3097,14 @@ static vm_fault_t do_wp_page(struct vm_f
return handle_userfault(vmf, VM_UFFD_WP);
}
+ /*
+ * Userfaultfd write-protect can defer flushes. Ensure the TLB
+ * is flushed in this case before copying.
+ */
+ if (unlikely(userfaultfd_wp(vmf->vma) &&
+ mm_tlb_flush_pending(vmf->vma->vm_mm)))
+ flush_tlb_page(vmf->vma, vmf->address);
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
_
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan: fix KASAN_STACK dependency for HW_TAGS
There's a runtime failure when running HW_TAGS-enabled kernel built with
GCC on hardware that doesn't support MTE. GCC-built kernels always have
CONFIG_KASAN_STACK enabled, even though stack instrumentation isn't
supported by HW_TAGS. Having that config enabled causes KASAN to issue
MTE-only instructions to unpoison kernel stacks, which causes the failure.
Fix the issue by disallowing CONFIG_KASAN_STACK when HW_TAGS is used.
(The commit that introduced CONFIG_KASAN_HW_TAGS specified proper
dependency for CONFIG_KASAN_STACK_ENABLE but not for CONFIG_KASAN_STACK.)
Link: https://lkml.kernel.org/r/59e75426241dbb5611277758c8d4d6f5f9298dac.16152154…
Fixes: 6a63a63ff1ac ("kasan: introduce CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Reported-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Evgenii Stepanov <eugenis(a)google.com>
Cc: Branislav Rankov <Branislav.Rankov(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/Kconfig.kasan | 1 +
1 file changed, 1 insertion(+)
--- a/lib/Kconfig.kasan~kasan-fix-kasan_stack-dependency-for-hw_tags
+++ a/lib/Kconfig.kasan
@@ -156,6 +156,7 @@ config KASAN_STACK_ENABLE
config KASAN_STACK
int
+ depends on KASAN_GENERIC || KASAN_SW_TAGS
default 1 if KASAN_STACK_ENABLE || CC_IS_GCC
default 0
_
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.
This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.
Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.16149873…
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Evgenii Stepanov <eugenis(a)google.com>
Cc: Branislav Rankov <Branislav.Rankov(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/page_alloc.c~kasan-mm-fix-crash-with-hw_tags-and-debug_pagealloc
+++ a/mm/page_alloc.c
@@ -1282,6 +1282,12 @@ static __always_inline bool free_pages_p
kernel_poison_pages(page, 1 << order);
/*
+ * With hardware tag-based KASAN, memory tags must be set before the
+ * page becomes unavailable via debug_pagealloc or arch_free_page.
+ */
+ kasan_free_nondeferred_pages(page, order);
+
+ /*
* arch_free_page() can make the page's contents inaccessible. s390
* does this. So nothing which can access the page's contents should
* happen after this.
@@ -1290,8 +1296,6 @@ static __always_inline bool free_pages_p
debug_pagealloc_unmap_pages(page, 1 << order);
- kasan_free_nondeferred_pages(page, order);
-
return true;
}
_
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: mm/madvise: replace ptrace attach requirement for process_madvise
process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process. It effectively removes the security boundary between the two
processes (in one direction). Granting ptrace attach capability even to a
system process is considered dangerous since it creates an attack surface.
This severely limits the usage of this API.
The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data is
physically located (and therefore, how fast it can be accessed). What we
want is the ability for one process to influence another process in order
to optimize performance across the entire system while leaving the
security boundary intact.
Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and
CAP_SYS_NICE for influencing process performance.
Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Acked-by: David Rientjes <rientjes(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Jeff Vander Stoep <jeffv(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: James Morris <jmorris(a)namei.org>
Cc: <stable(a)vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/madvise.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/mm/madvise.c~mm-madvise-replace-ptrace-attach-requirement-for-process_madvise
+++ a/mm/madvise.c
@@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pi
goto release_task;
}
- mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
+ /*
+ * Require CAP_SYS_NICE for influencing process performance. Note that
+ * only non-destructive hints are currently supported.
+ */
+ if (!capable(CAP_SYS_NICE)) {
+ ret = -EPERM;
+ goto release_mm;
+ }
+
total_len = iov_iter_count(&iter);
while (iov_iter_count(&iter)) {
@@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pi
if (ret == 0)
ret = total_len - iov_iter_count(&iter);
+release_mm:
mmput(mm);
release_task:
put_task_struct(task);
_
From: Arnd Bergmann <arnd(a)arndb.de>
Subject: linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
Separating compiler-clang.h from compiler-gcc.h inadventently dropped the
definitions of the three HAVE_BUILTIN_BSWAP macros, which requires falling
back to the open-coded version and hoping that the compiler detects it.
Since all versions of clang support the __builtin_bswap interfaces, add
back the flags and have the headers pick these up automatically.
This results in a 4% improvement of compilation speed for arm defconfig.
Note: it might also be worth revisiting which architectures set
CONFIG_ARCH_USE_BUILTIN_BSWAP for one compiler or the other, today this is
set on six architectures (arm32, csky, mips, powerpc, s390, x86), while
another ten architectures define custom helpers (alpha, arc, ia64, m68k,
mips, nios2, parisc, sh, sparc, xtensa), and the rest (arm64, h8300,
hexagon, microblaze, nds32, openrisc, riscv) just get the unoptimized
version and rely on the compiler to detect it.
A long time ago, the compiler builtins were architecture specific, but
nowadays, all compilers that are able to build the kernel have correct
implementations of them, though some may not be as optimized as the inline
asm versions.
The patch that dropped the optimization landed in v4.19, so as discussed
it would be fairly safe to backport this revert to stable kernels to the
4.19/5.4/5.10 stable kernels, but there is a remaining risk for
regressions, and it has no known side-effects besides compile speed.
Link: https://lkml.kernel.org/r/20210226161151.2629097-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20210225164513.3667778-1-arnd@kernel.org/
Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Reviewed-by: Nathan Chancellor <nathan(a)kernel.org>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Miguel Ojeda <ojeda(a)kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers(a)google.com>
Acked-by: Luc Van Oostenryck <luc.vanoostenryck(a)gmail.com>
Cc: Masahiro Yamada <masahiroy(a)kernel.org>
Cc: Nick Hu <nickhu(a)andestech.com>
Cc: Greentime Hu <green.hu(a)gmail.com>
Cc: Vincent Chen <deanbo422(a)gmail.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Albert Ou <aou(a)eecs.berkeley.edu>
Cc: Guo Ren <guoren(a)kernel.org>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: Sami Tolvanen <samitolvanen(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Arvind Sankar <nivedita(a)alum.mit.edu>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/compiler-clang.h | 6 ++++++
1 file changed, 6 insertions(+)
--- a/include/linux/compiler-clang.h~linux-compiler-clangh-define-have_builtin_bswap
+++ a/include/linux/compiler-clang.h
@@ -31,6 +31,12 @@
#define __no_sanitize_thread
#endif
+#if defined(CONFIG_ARCH_USE_BUILTIN_BSWAP)
+#define __HAVE_BUILTIN_BSWAP32__
+#define __HAVE_BUILTIN_BSWAP64__
+#define __HAVE_BUILTIN_BSWAP16__
+#endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
+
#if __has_feature(undefined_behavior_sanitizer)
/* GCC does not have __SANITIZE_UNDEFINED__ */
#define __no_sanitize_undefined \
_