Kirill Tkhai ktkhai@virtuozzo.com wrote:
Hi, Greg,
good finding. See comments below.
On 01.06.2020 06:22, Greg Thelen wrote:
Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") a memcg aware shrinker is only called when the per-memcg per-node shrinker_map indicates that the shrinker may have objects to release to the memcg and node.
shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs shrinker which advertises per memcg and numa awareness. The shmem shrinker releases memory by splitting hugepages that extend beyond i_size.
Shmem does not currently set bits in shrinker_map. So, starting with b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under pressure. This leads to undeserved memcg OOM kills. Example that reliably sees memcg OOM kill in unpatched kernel: FS=/tmp/fs CONTAINER=/cgroup/memory/tmpfs_shrinker mkdir -p $FS mount -t tmpfs -o huge=always nodev $FS # Create 1000 MB container, which shouldn't suffer OOM. mkdir $CONTAINER echo 1000M > $CONTAINER/memory.limit_in_bytes echo $BASHPID >> $CONTAINER/cgroup.procs # Create 4000 files. Ideally each file uses 4k data page + a little # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily # fit within container's 1000 MB. But if data pages use 2MB # hugepages (due to aggressive huge=always) then files consume 8GB, # which hits memcg 1000 MB limit. for i in {1..4000}; do echo . > $FS/$i done
v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") maintains the per-node per-memcg shrinker bitmap for THP shrinker. But there's no such logic in shmem. Make shmem set the per-memcg per-node shrinker bits when it modifies inodes to have shrinkable pages.
Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Greg Thelen gthelen@google.com
mm/shmem.c | 61 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 26 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index bd8840082c94..e11090f78cb5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat, return 0; } +/*
- Expose inode and optional page to shrinker as having a possibly splittable
- hugepage that reaches beyond i_size.
- */
+static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
struct inode *inode, struct page *page)
+{
- struct shmem_inode_info *info = SHMEM_I(inode);
- spin_lock(&sbinfo->shrinklist_lock);
- /*
* _careful to defend against unlocked access to ->shrink_list in
* shmem_unused_huge_shrink()
*/
- if (list_empty_careful(&info->shrinklist)) {
list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
sbinfo->shrinklist_len++;
- }
- spin_unlock(&sbinfo->shrinklist_lock);
+#ifdef CONFIG_MEMCG
- if (page && PageTransHuge(page))
memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
inode->i_sb->s_shrink.id);
+#endif +}
static int shmem_setattr(struct dentry *dentry, struct iattr *attr) { struct inode *inode = d_inode(dentry); @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) * to shrink under memory pressure. */ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
spin_lock(&sbinfo->shrinklist_lock);
/*
* _careful to defend against unlocked access to
* ->shrink_list in shmem_unused_huge_shrink()
*/
if (list_empty_careful(&info->shrinklist)) {
list_add_tail(&info->shrinklist,
&sbinfo->shrinklist);
sbinfo->shrinklist_len++;
}
spin_unlock(&sbinfo->shrinklist_lock);
struct page *page;
page = find_get_page(inode->i_mapping,
(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
shmem_shrinker_add(sbinfo, inode, page);
if (page)
put_page(page);
1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge, it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit(). Nothing should be added to the shrinklist in this case.
Ack, I'll take a look at this.
2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of .nr_cached_objects callback of generic fs shrinker.
CC: Kirill Shutemov
Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker? Are there any no-go to for conversion this as a separate !memcg-aware shrinker?
Making the shmem shrinker !memcg-aware seems like it would require tail pages beyond i_size not be charged to memcg. Otherwise memcg pressure (which only calls memcg aware shrinkers) won't uncharge them. Currently the entire compound page is charged.
} }
} @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (PageTransHuge(page) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < hindex + HPAGE_PMD_NR - 1) {
/*
* Part of the huge page is beyond i_size: subject
* to shrink under memory pressure.
*/
spin_lock(&sbinfo->shrinklist_lock);
/*
* _careful to defend against unlocked access to
* ->shrink_list in shmem_unused_huge_shrink()
*/
if (list_empty_careful(&info->shrinklist)) {
list_add_tail(&info->shrinklist,
&sbinfo->shrinklist);
sbinfo->shrinklist_len++;
}
spin_unlock(&sbinfo->shrinklist_lock);
}shmem_shrinker_add(sbinfo, inode, page);
/*
Thanks, Kirill