On Fri, Jun 07, 2024 at 03:48:06PM +0200, Jesper Dangaard Brouer wrote:
From: Shakeel Butt shakeelb@google.com
commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.
One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer upstream kernel and on further investigation, it seems like the cause is the always synchronous rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats"). On further inspection it seems like we don't really need accurate stats in this function as it was already approximating the amount of appropriate shadow entries to keep for maintaining the refault information. Since there is already 2 sec periodic rstat flush, we don't need exact stats here. Let's ratelimit the rstat flush in this code path.
Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") Signed-off-by: Shakeel Butt shakeelb@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Michal Hocko mhocko@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Muchun Song songmuchun@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jesper Dangaard Brouer hawk@kernel.org
On production with kernel v6.6 we are observing issues with excessive cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats() in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") that commit is part of v6.6. We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush from workingset shrinker") as it have a fixes tag for this commit.
IMHO it is worth explaining call path that makes count_shadow_nodes() cause excessive cgroup rstat flushing calls. Function shrink_node() calls mem_cgroup_flush_stats() on its own first, and then invokes shrink_node_memcgs(). Function shrink_node_memcgs() iterates over cgroups via mem_cgroup_iter() for each calling shrink_slab(). The shrink_slab() calls do_shrink_slab() that via shrinker->count_objects() invoke count_shadow_nodes(), and count_shadow_nodes() does a mem_cgroup_flush_stats() call, that seems unnecessary.
Backport differs slightly due to v6.6.32 doesn't contain commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing") from v6.8.
Now queued up, thanks.
greg k-h