Re: [PATCH] Partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"

20 Aug 2019


      On Tue, Aug 20, 2019 at 5:20 AM Roman Gushchin guro@fb.com wrote:
...
On Sun, Aug 18, 2019 at 08:30:15AM +0800, Yafang Shao wrote:
...
On Sun, Aug 18, 2019 at 3:14 AM Roman Gushchin guro@fb.com wrote:
...
On Sat, Aug 17, 2019 at 11:33:57AM +0800, Yafang Shao wrote:
...
On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin guro@fb.com wrote:
...
Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
with the hierarchical ones") effectively decreased the precision of
per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
That's good for displaying in memory.stat, but brings a serious regression
into the reclaim process.
One issue I've discovered and debugged is the following:
lruvec_lru_size() can return 0 instead of the actual number of pages
in the lru list, preventing the kernel to reclaim last remaining
pages. Result is yet another dying memory cgroups flooding.
The opposite is also happening: scanning an empty lru list
is the waste of cpu time.
Also, inactive_list_is_low() can return incorrect values, preventing
the active lru from being scanned and freed. It can fail both because
the size of active and inactive lists are inaccurate, and because
the number of workingset refaults isn't precise. In other words,
the result is pretty random.
I'm not sure, if using the approximate number of slab pages in
count_shadow_number() is acceptable, but issues described above
are enough to partially revert the patch.
Let's keep per-memcg vmstat_local batched (they are only used for
displaying stats to the userspace), but keep lruvec stats precise.
This change fixes the dead memcg flooding on my setup.
That will make some misunderstanding if the local counters are not in
sync with the hierarchical ones
(someone may doubt whether there're something leaked.).
Sure, but the actual leakage is a much more serious issue.
...
If we have to do it like this, I think we should better document this behavior.
Lru size calculations can be done using per-zone counters, which is
actually cheaper, because the number of zones is usually smaller than
the number of cpus. I'll send a corresponding patch on Monday.
Looks like a good idea.
...
Maybe other use cases can also be converted?
We'd better keep the behavior the same across counters. I think you
can have a try.
As I said, consistency of counters is important, but not nearly as important
as the real behavior of the system. Especially because we talk about
per-node memcg statistics, which I believe is mostly used for debugging.
So for now I think the right thing to do is to revert the change to fix
the memory reclaim process. And then we can discuss how to get counters
right.
Sure.
Thanks
Yafang

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] Partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"