When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all. The further
investigation shows the memcg's vfs shrinker_map bit is not set. So the
reclaimer or dropping cache just skip calling vfs shrinker. So we have
to reboot the hosts to get the memory back.
I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting. But it seems there is race
between shrinker map bit clear and reparenting by code inspection. The
hypothesis is elaborated as below.
The memcg hierarchy on our production environment looks like:
root
/ \
system user
The main workloads are running under user slice's children, and it creates
and removes memcg frequently. So reparenting happens very often under user
slice, but no task is under user slice directly.
So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:
CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
return SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()
The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 coincidently
keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst
After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.
How does list_lru_del() race with reparenting? It is because
reparenting replaces children's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.
Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.
Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin <guro(a)fb.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: <stable(a)vger.kernel.org> v4.19+
Signed-off-by: Yang Shi <shy828301(a)gmail.com>
---
v4: * Fixed the spelling errors found by Shakeel
* Added ack/review tag from Kirill and Shakeel
v3: * Revised commit log per Roman's suggestion
* Added Roman's reviewed-by tag
v2: * Incorporated Roman's suggestion
* Incorporated Kirill's suggestion
* Changed the subject of patch to get align with the new fix
* Added fixes tag
mm/list_lru.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5aa6e44bc2ae..fe230081690b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
- bool set;
/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
dst = list_lru_from_memcg_idx(nlru, dst_idx);
list_splice_init(&src->list, &dst->list);
- set = (!dst->nr_items && src->nr_items);
- dst->nr_items += src->nr_items;
- if (set)
+
+ if (src->nr_items) {
+ dst->nr_items += src->nr_items;
memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
- src->nr_items = 0;
+ src->nr_items = 0;
+ }
spin_unlock_irq(&nlru->lock);
}
--
2.26.2
This is the start of the stable review cycle for the 4.4.247 release.
There are 24 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 03 Dec 2020 08:46:29 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.247-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.247-rc1
Filipe Manana <fdmanana(a)suse.com>
btrfs: fix lockdep splat when reading qgroup config on mount
Alan Stern <stern(a)rowland.harvard.edu>
USB: core: Fix regression in Hercules audio card
Johan Hovold <johan(a)kernel.org>
USB: core: add endpoint-blacklist quirk
Anand K Mistry <amistry(a)google.com>
x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
Alan Stern <stern(a)rowland.harvard.edu>
USB: core: Change %pK for __user pointers to %px
Masami Hiramatsu <mhiramat(a)kernel.org>
perf probe: Fix to die_entrypc() returns error correctly
Ard Biesheuvel <ardb(a)kernel.org>
efivarfs: revert "fix memory leak in efivarfs_create()"
Krzysztof Kozlowski <krzk(a)kernel.org>
nfc: s3fwrn5: use signed integer for parsing GPIO numbers
Xiongfeng Wang <wangxiongfeng2(a)huawei.com>
IB/mthca: fix return value of error branch in mthca_init_cq()
Michael Chan <michael.chan(a)broadcom.com>
bnxt_en: Release PCI regions when DMA mask setup fails during probe.
Dexuan Cui <decui(a)microsoft.com>
video: hyperv_fb: Fix the cache type when mapping the VRAM
Zhang Changzhong <zhangchangzhong(a)huawei.com>
bnxt_en: fix error return code in bnxt_init_board()
Stanley Chu <stanley.chu(a)mediatek.com>
scsi: ufs: Fix race between shutdown and runtime resume flow
Mike Christie <michael.christie(a)oracle.com>
scsi: target: iscsi: Fix cmd abort fabric stop race
Lee Duncan <lduncan(a)suse.com>
scsi: libiscsi: Fix NOP race condition
Sugar Zhang <sugar.zhang(a)rock-chips.com>
dmaengine: pl330: _prep_dma_memcpy: Fix wrong burst size
Jens Axboe <axboe(a)kernel.dk>
proc: don't allow async path resolution of /proc/self components
Brian Masney <bmasney(a)redhat.com>
x86/xen: don't unbind uninitialized lock_kicker_irq
Pablo Ceballos <pceballos(a)google.com>
HID: hid-sensor-hub: Fix issue with devices with no report ID
Hans de Goede <hdegoede(a)redhat.com>
Input: i8042 - allow insmod to succeed on devices without an i8042 controller
Frank Yang <puilp0502(a)gmail.com>
HID: cypress: Support Varmilo Keyboards' media hotkeys
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: Fix split-irqchip vs interrupt injection window request
Qu Wenruo <wqu(a)suse.com>
btrfs: inode: Verify inode mode to avoid NULL pointer dereference
Qu Wenruo <wqu(a)suse.com>
btrfs: tree-checker: Enhance chunk checker to validate chunk profile
-------------
Diffstat:
Makefile | 4 +--
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kernel/cpu/bugs.c | 4 +--
arch/x86/kvm/irq.c | 2 +-
arch/x86/kvm/x86.c | 18 +++++++------
arch/x86/xen/spinlock.c | 12 ++++++++-
drivers/dma/pl330.c | 2 +-
drivers/hid/hid-cypress.c | 44 +++++++++++++++++++++++++++----
drivers/hid/hid-ids.h | 2 ++
drivers/hid/hid-sensor-hub.c | 3 ++-
drivers/infiniband/hw/mthca/mthca_cq.c | 10 ++++---
drivers/input/serio/i8042.c | 12 ++++++++-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
drivers/nfc/s3fwrn5/i2c.c | 4 +--
drivers/scsi/libiscsi.c | 23 ++++++++++------
drivers/scsi/ufs/ufshcd.c | 6 +----
drivers/target/iscsi/iscsi_target.c | 17 +++++++++---
drivers/usb/core/config.c | 11 ++++++++
drivers/usb/core/devio.c | 4 +--
drivers/usb/core/quirks.c | 38 ++++++++++++++++++++++++++
drivers/usb/core/usb.h | 3 +++
drivers/video/fbdev/hyperv_fb.c | 7 ++++-
fs/btrfs/inode.c | 41 ++++++++++++++++++++++------
fs/btrfs/qgroup.c | 2 +-
fs/btrfs/tests/inode-tests.c | 1 +
fs/btrfs/volumes.c | 7 +++++
fs/efivarfs/inode.c | 2 ++
fs/efivarfs/super.c | 1 -
fs/proc/self.c | 7 +++++
include/linux/usb/quirks.h | 3 +++
include/scsi/libiscsi.h | 3 +++
tools/perf/util/dwarf-aux.c | 8 ++++++
32 files changed, 246 insertions(+), 59 deletions(-)
evm_inode_init_security() requires an HMAC key to calculate the HMAC on
initial xattrs provided by LSMs. However, it checks generically whether a
key has been loaded, including also public keys, which is not correct as
public keys are not suitable to calculate the HMAC.
Originally, support for signature verification was introduced to verify a
possibly immutable initial ram disk, when no new files are created, and to
switch to HMAC for the root filesystem. By that time, an HMAC key should
have been loaded and usable to calculate HMACs for new files.
More recently support for requiring an HMAC key was removed from the
kernel, so that signature verification can be used alone. Since this is a
legitimate use case, evm_inode_init_security() should not return an error
when no HMAC key has been loaded.
This patch fixes this problem by replacing the evm_key_loaded() check with
a check of the EVM_INIT_HMAC flag in evm_initialized.
Cc: stable(a)vger.kernel.org # 4.5.x
Fixes: 26ddabfe96b ("evm: enable EVM when X509 certificate is loaded")
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Reviewed-by: Mimi Zohar <zohar(a)linux.ibm.com>
---
security/integrity/evm/evm_main.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c
index 76d19146d74b..001e001eae01 100644
--- a/security/integrity/evm/evm_main.c
+++ b/security/integrity/evm/evm_main.c
@@ -530,7 +530,8 @@ int evm_inode_init_security(struct inode *inode,
struct evm_xattr *xattr_data;
int rc;
- if (!evm_key_loaded() || !evm_protected_xattr(lsm_xattr->name))
+ if (!(evm_initialized & EVM_INIT_HMAC) ||
+ !evm_protected_xattr(lsm_xattr->name))
return 0;
xattr_data = kzalloc(sizeof(*xattr_data), GFP_NOFS);
--
2.27.GIT
When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all. The further
investigation shows the memcg's vfs shrinker_map bit is not set. So the
reclaimer or dropping cache just skip calling vfs shrinker. So we have
to reboot the hosts to get the memory back.
I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting. But it seems there is race
between shrinker map bit clear and reparenting by code inspection. The
hypothesis is elaborated as below.
The memcg hierarchy on our production environment looks like:
root
/ \
system user
The main workloads are running under user slice's children, and it creates
and removes memcg frequently. So reparenting happens very often under user
slice, but no task is under user slice directly.
So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:
CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
retrun SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id to
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()
The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 conincidently
keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst
After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.
How does list_lru_del() race with reparenting? It is because
reparenting replaces childen's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.
Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.
Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin <guro(a)fb.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: <stable(a)vger.kernel.org> v4.19+
Signed-off-by: Yang Shi <shy828301(a)gmail.com>
---
v3: * Revised commit log per Roman's suggestion
* Added Roman's reviewed-by tag
v2: * Incorporated Roman's suggestion
* Incorporated Kirill's suggestion
* Changed the subject of patch to get align with the new fix
* Added fixes tag
mm/list_lru.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5aa6e44bc2ae..fe230081690b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
- bool set;
/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
dst = list_lru_from_memcg_idx(nlru, dst_idx);
list_splice_init(&src->list, &dst->list);
- set = (!dst->nr_items && src->nr_items);
- dst->nr_items += src->nr_items;
- if (set)
+
+ if (src->nr_items) {
+ dst->nr_items += src->nr_items;
memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
- src->nr_items = 0;
+ src->nr_items = 0;
+ }
spin_unlock_irq(&nlru->lock);
}
--
2.26.2