Yang Shi shy828301@gmail.com wrote:
On Sun, May 31, 2020 at 8:22 PM Greg Thelen gthelen@google.com wrote:
Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") a memcg aware shrinker is only called when the per-memcg per-node shrinker_map indicates that the shrinker may have objects to release to the memcg and node.
shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs shrinker which advertises per memcg and numa awareness. The shmem shrinker releases memory by splitting hugepages that extend beyond i_size.
Shmem does not currently set bits in shrinker_map. So, starting with b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under pressure. This leads to undeserved memcg OOM kills. Example that reliably sees memcg OOM kill in unpatched kernel: FS=/tmp/fs CONTAINER=/cgroup/memory/tmpfs_shrinker mkdir -p $FS mount -t tmpfs -o huge=always nodev $FS # Create 1000 MB container, which shouldn't suffer OOM. mkdir $CONTAINER echo 1000M > $CONTAINER/memory.limit_in_bytes echo $BASHPID >> $CONTAINER/cgroup.procs # Create 4000 files. Ideally each file uses 4k data page + a little # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily # fit within container's 1000 MB. But if data pages use 2MB # hugepages (due to aggressive huge=always) then files consume 8GB, # which hits memcg 1000 MB limit. for i in {1..4000}; do echo . > $FS/$i done
It looks all the inodes which have tail THP beyond i_size are on one single list, then the shrinker actually just splits the first nr_to_scan inodes. But since the list is not memcg aware, so it seems it may split the THPs which are not charged to the victim memcg and the victim memcg still may suffer from pre-mature oom, right?
Correct. shmem_unused_huge_shrink() is not memcg aware. In response to memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs inodes regardless of if they're charged to the under-pressure memcg. do_shrink_slab() looks like it'll repeatedly call shmem_unused_huge_shrink(). So it will split tails of many inodes. So I think it'll avoid the oom by over shrinking. This is not ideal. But it seems better than undeserved oom kill.
I think the solution (as Kirill Tkhai suggested) a memcg-aware index would solve both: 1) avoid premature oom by registering shrinker to responding to memcg pressure 2) avoid shrinking/splitting inodes unrelated to the under-pressure memcg
I can certainly look into that (thanks Kirill for the pointers). In the short term I'm still interested in avoiding premature OOMs with the original thread (i.e. restore pre-4.19 behavior to shmem shrinker for memcg pressure). I plan to test and repost v2.