Hi,
Please consider applying the patch from this thread to 5.8.y:
commit f80b08fc44536a311a9f3182e50f318b79076425
The fix should also go into 5.4.y, however the patch needs some minor
adjustments due to surrounding context differences. Attached below is a
version I have tested against 5.4.71.
This solves a "page allocation failure" error that can be reproduced
both on physical hardware, and also under qemu-system-arm. The test
consists of repeatedly running md5sum on a large file. In my tests the
file contains 1GB of random data, while the system has only 256MB RAM.
No other tasks are running or consuming significant memory.
After some time (between 1 and 200 iterations) the kernel reports a page
allocation failure. Additional failures occur fairly quickly thereafter.
The md5sum is correctly computed in each case. The OOM is not invoked.
The backtrace shows a 0-order GFP_ATOMIC was requested, with quite a
bit of memory available, and yet the allocation fails.
Similar error also occurs when "md5sum" is replaced by "scp" or "nc".
The backtrace again shows a 0-order with GFP_ATOMIC that fails, with
plenty of memory available according to the Mem-Info dump.
The problem does not occur under 4.9.y or 4.19.y. Bisction has found
that the problem started to occur with 688fcbfc06e4 ("mm/vmalloc: modify
struct vmap_area to reduce its size") during the 5.4 dev cycle.
I can provide additional logs and details if interested.
Thanks,
Ralph
Below is the f80b08fc445 commit, tweaked to apply to 5.4.y.
From: Charan Teja Reddy <charante(a)codeaurora.org>
Subject: [PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0
allocations
[upstream commit f80b08fc44536a311a9f3182e50f318b79076425
with context adjusted to match linux-5.4.y]
When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset. This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.
This problem is seen frequently on system setup of Android kernel running
on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred
in the system, ->watermark_boost factor is zero, thus the watermark
configurations in the system are:
_watermark = (
[WMARK_MIN] = 1272, --> ~5MB
[WMARK_LOW] = 9067, --> ~36MB
[WMARK_HIGH] = 9385), --> ~38MB
watermark_boost = 0
After launching some memory hungry applications in Android which can cause
extfrag events in the system to an extent that ->watermark_boost can be
set to max i.e. default boost factor makes it to 150% of high watermark.
_watermark = (
[WMARK_MIN] = 1272, --> ~5MB
[WMARK_LOW] = 9067, --> ~36MB
[WMARK_HIGH] = 9385), --> ~38MB
watermark_boost = 14077, -->~57MB
With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice. But boosting makes the
min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
system should have minimum of ~23MB of free memory(from calculations of
zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite
system is having ~20MB of free memory. In the testing, this is
reproducible as early as first 300secs since boot and with furtherlowram
configurations(<2GB) it is observed as early as first 150secs since boot.
These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.
[akpm(a)linux-foundation.org: fix comment grammar, reflow comment]
[charante(a)codeaurora.org: fix suggested by Mel Gorman]
Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Vinayak Menon <vinmenon(a)codeaurora.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaur…
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Ralph Siemsen <ralph.siemsen(a)linaro.org>
---
mm/page_alloc.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aff0bb4629bd..b0e9ea4c220e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3484,7 +3484,8 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
}
static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx, unsigned int alloc_flags)
+ unsigned long mark, int classzone_idx,
+ unsigned int alloc_flags, gfp_t gfp_mask)
{
long free_pages = zone_page_state(z, NR_FREE_PAGES);
long cma_pages = 0;
@@ -3505,8 +3506,23 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
return true;
- return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
- free_pages);
+ if (__zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+ free_pages))
+ return true;
+ /*
+ * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
+ * when checking the min watermark. The min watermark is the
+ * point where boosting is ignored so that kswapd is woken up
+ * when below the low watermark.
+ */
+ if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
+ && ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
+ mark = z->_watermark[WMARK_MIN];
+ return __zone_watermark_ok(z, order, mark, classzone_idx,
+ alloc_flags, free_pages);
+ }
+
+ return false;
}
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
@@ -3647,7 +3663,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if (!zone_watermark_fast(zone, order, mark,
- ac_classzone_idx(ac), alloc_flags)) {
+ ac_classzone_idx(ac), alloc_flags,
+ gfp_mask)) {
int ret;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
--
2.17.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1b02d9e770cd7087f34c743f85ccf5ea8372b047 Mon Sep 17 00:00:00 2001
From: Bartosz Golaszewski <bgolaszewski(a)baylibre.com>
Date: Tue, 8 Sep 2020 15:07:49 +0200
Subject: [PATCH] gpio: mockup: fix resource leak in error path
If the module init function fails after creating the debugs directory,
it's never removed. Add proper cleanup calls to avoid this resource
leak.
Fixes: 9202ba2397d1 ("gpio: mockup: implement event injecting over debugfs")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Bartosz Golaszewski <bgolaszewski(a)baylibre.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c
index bc345185db26..1652897fdf90 100644
--- a/drivers/gpio/gpio-mockup.c
+++ b/drivers/gpio/gpio-mockup.c
@@ -552,6 +552,7 @@ static int __init gpio_mockup_init(void)
err = platform_driver_register(&gpio_mockup_driver);
if (err) {
gpio_mockup_err("error registering platform driver\n");
+ debugfs_remove_recursive(gpio_mockup_dbg_dir);
return err;
}
@@ -582,6 +583,7 @@ static int __init gpio_mockup_init(void)
gpio_mockup_err("error registering device");
platform_driver_unregister(&gpio_mockup_driver);
gpio_mockup_unregister_pdevs();
+ debugfs_remove_recursive(gpio_mockup_dbg_dir);
return PTR_ERR(pdev);
}
DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.
1) We just disallow loading and directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"
2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.
CC: stable(a)vger.kernel.org
Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/ext4/dir.c | 14 ++++++++------
fs/ext4/ext4.h | 5 ++++-
fs/ext4/inode.c | 13 +++++++++++++
fs/ext4/namei.c | 7 +++++++
4 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9f00fc0bf21d..cb9ea593b544 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -129,12 +129,14 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
if (err != ERR_BAD_DX_DIR) {
return err;
}
- /*
- * We don't set the inode dirty flag since it's not
- * critical that it get flushed back to the disk.
- */
- ext4_clear_inode_flag(file_inode(file),
- EXT4_INODE_INDEX);
+ /* Can we just clear INDEX flag to ignore htree information? */
+ if (!ext4_has_metadata_csum(sb)) {
+ /*
+ * We don't set the inode dirty flag since it's not
+ * critical that it gets flushed back to the disk.
+ */
+ ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
+ }
}
if (ext4_has_inline_data(inode)) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f8578caba40d..1fd6c1e2ce2a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2482,8 +2482,11 @@ void ext4_insert_dentry(struct inode *inode,
struct ext4_filename *fname);
static inline void ext4_update_dx_flag(struct inode *inode)
{
- if (!ext4_has_feature_dir_index(inode->i_sb))
+ if (!ext4_has_feature_dir_index(inode->i_sb)) {
+ /* ext4_iget() should have caught this... */
+ WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb));
ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
+ }
}
static const unsigned char ext4_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 629a25d999f0..d33135308c1b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4615,6 +4615,19 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ret = -EFSCORRUPTED;
goto bad_inode;
}
+ /*
+ * If dir_index is not enabled but there's dir with INDEX flag set,
+ * we'd normally treat htree data as empty space. But with metadata
+ * checksumming that corrupts checksums so forbid that.
+ */
+ if (!ext4_has_feature_dir_index(sb) && ext4_has_metadata_csum(sb) &&
+ ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) {
+ ext4_error_inode(inode, function, line, 0,
+ "iget: Dir with htree data on filesystem "
+ "without dir_index feature.");
+ ret = -EFSCORRUPTED;
+ goto bad_inode;
+ }
ei->i_disksize = inode->i_size;
#ifdef CONFIG_QUOTA
ei->i_reserved_quota = 0;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 1cb42d940784..deb9f7a02976 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2207,6 +2207,13 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
retval = ext4_dx_add_entry(handle, &fname, dir, inode);
if (!retval || (retval != ERR_BAD_DX_DIR))
goto out;
+ /* Can we just ignore htree data? */
+ if (ext4_has_metadata_csum(sb)) {
+ EXT4_ERROR_INODE(dir,
+ "Directory has corrupted htree index.");
+ retval = -EFSCORRUPTED;
+ goto out;
+ }
ext4_clear_inode_flag(dir, EXT4_INODE_INDEX);
dx_fallback++;
ext4_mark_inode_dirty(handle, dir);
--
2.16.4
Retry.
On Wed, Oct 28, 2020 at 10:10:35AM -0700, Guenter Roeck wrote:
> On Tue, Oct 27, 2020 at 02:50:58PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.19.153 release.
> > There are 264 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Thu, 29 Oct 2020 13:53:47 +0000.
> > Anything received after that time might be too late.
> >
>
> Build results:
> total: 155 pass: 152 fail: 3
> Failed builds:
> i386:tools/perf
> powerpc:ppc6xx_defconfig
> x86_64:tools/perf
> Qemu test results:
> total: 417 pass: 417 fail: 0
>
> perf failures are as usual. powerpc:
>
> arch/powerpc/kernel/tau_6xx.c: In function 'TAU_init':
> include/linux/workqueue.h:427:24: error: too many arguments for format
>
> Tested-by: Guenter Roeck <linux(a)roeck-us.net>
>
> Guenter
This series replaces all the use of security_capable(current_cred(),
...) with ns_capable{,_noaudit}() which set PF_SUPERPRIV.
This initially come from a review of Landlock by Jann Horn:
https://lore.kernel.org/lkml/CAG48ez1FQVkt78129WozBwFbVhAPyAr9oJAHFHAbbNxEB…
Mickaël Salaün (2):
ptrace: Set PF_SUPERPRIV when checking capability
seccomp: Set PF_SUPERPRIV when checking capability
kernel/ptrace.c | 18 ++++++------------
kernel/seccomp.c | 5 ++---
2 files changed, 8 insertions(+), 15 deletions(-)
base-commit: 3650b228f83adda7e5ee532e2b90429c03f7b9ec
--
2.28.0
This reverts commit 00fdec98d9881bf5173af09aebd353ab3b9ac729.
(but only from 5.2 and prior kernels)
The original commit was a preventive fix based on code-review and was
auto-picked for stable back-port (for better or worse).
It was OK for v5.3+ kernels, but turned up needing an implicit change
68e5c6f073bcf70 "(ARC: entry: EV_Trap expects r10 (vs. r9) to have
exception cause)" merged in v5.3 which itself was not backported.
So to summarize the stable backport of this patch for v5.2 and prior
kernels is busted and it won't boot.
The obvious solution is backport 68e5c6f073bcf70 but that is a pain as
it doesn't revert cleanly and each of affected kernels (so far v4.19,
v4.14, v4.9, v4.4) needs a slightly different massaged varaint.
So the easier fix is to simply revert the backport from 5.2 and prior.
The issue was not a big deal as it would cause strace to sporadically
not work correctly.
Waldemar Brodkorb first reported this when running ARC uClibc regressions
on latest stable kernels (with offending backport). Once he bisected it,
the analysis was trivial, so thx to him for this.
Reported-by: Waldemar Brodkorb <wbx(a)uclibc-ng.org>
Bisected-by: Waldemar Brodkorb <wbx(a)uclibc-ng.org>
Cc: stable <stable(a)vger.kernel.org> # 5.2 and prior
Signed-off-by: Vineet Gupta <vgupta(a)synopsys.com>
---
arch/arc/kernel/entry.S | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/arc/kernel/entry.S b/arch/arc/kernel/entry.S
index ea00c8a17f07..60406ec62eb8 100644
--- a/arch/arc/kernel/entry.S
+++ b/arch/arc/kernel/entry.S
@@ -165,6 +165,7 @@ END(EV_Extension)
tracesys:
; save EFA in case tracer wants the PC of traced task
; using ERET won't work since next-PC has already committed
+ lr r12, [efa]
GET_CURR_TASK_FIELD_PTR TASK_THREAD, r11
st r12, [r11, THREAD_FAULT_ADDR] ; thread.fault_address
@@ -207,9 +208,15 @@ tracesys_exit:
; Breakpoint TRAP
; ---------------------------------------------
trap_with_param:
- mov r0, r12 ; EFA in case ptracer/gdb wants stop_pc
+
+ ; stop_pc info by gdb needs this info
+ lr r0, [efa]
mov r1, sp
+ ; Now that we have read EFA, it is safe to do "fake" rtie
+ ; and get out of CPU exception mode
+ FAKE_RET_FROM_EXCPN
+
; Save callee regs in case gdb wants to have a look
; SP will grow up by size of CALLEE Reg-File
; NOTE: clobbers r12
@@ -236,10 +243,6 @@ ENTRY(EV_Trap)
EXCEPTION_PROLOGUE
- lr r12, [efa]
-
- FAKE_RET_FROM_EXCPN
-
;============ TRAP 1 :breakpoints
; Check ECR for trap with arg (PROLOGUE ensures r10 has ECR)
bmsk.f 0, r10, 7
@@ -247,6 +250,9 @@ ENTRY(EV_Trap)
;============ TRAP (no param): syscall top level
+ ; First return from Exception to pure K mode (Exception/IRQs renabled)
+ FAKE_RET_FROM_EXCPN
+
; If syscall tracing ongoing, invoke pre-post-hooks
GET_CURR_THR_INFO_FLAGS r10
btst r10, TIF_SYSCALL_TRACE
--
2.25.1
As the error capture will compress user buffers as directed to by the
user, it can take an arbitrary amount of time and space. Break up the
compression loops with a call to cond_resched(), that will allow other
processes to schedule (avoiding the soft lockups) and also serve as a
warning should we try to make this loop atomic in the future.
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
---
drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 3e6cbb0d1150..a635ec8d0b94 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -311,6 +311,8 @@ static int compress_page(struct i915_vma_compress *c,
if (zlib_deflate(zstream, Z_NO_FLUSH) != Z_OK)
return -EIO;
+
+ cond_resched();
} while (zstream->avail_in);
/* Fallback to uncompressed if we increase size? */
@@ -397,6 +399,7 @@ static int compress_page(struct i915_vma_compress *c,
if (!(wc && i915_memcpy_from_wc(ptr, src, PAGE_SIZE)))
memcpy(ptr, src, PAGE_SIZE);
dst->pages[dst->page_count++] = ptr;
+ cond_resched();
return 0;
}
--
2.20.1