Re: [PATCH] Revert "memcg: cleanup racy sum avoidance code"

18 Aug 2022


      On Wed, Aug 17, 2022 at 05:21:39PM +0000, Shakeel Butt shakeelb@google.com wrote:
...
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224
Re-run after couple of seconds
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248
For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this
negative sock is due to rstat flusher flushing the stats on the CPU that
has seen the decrement of sock but missed the CPU that has increments. A
typical race condition.
This theory adds up :-) (Provided the numbers.)
...
For easy stable backport, revert is the most simple solution.
Sounds reasonable.
...
For long term solution, I am thinking of two directions. First is just
reduce the race window by optimizing the rstat flusher. Second is if
the reader sees a negative stat value, force flush and restart the
stat collection.  Basically retry but limited.
Or just stick with the revert since it already reduces the observed
error by rounding to zero in simple way.
(Or if the imprecision was worth extra storage, use two-stage flushing
to accumulate (cpus x cgroups) and assign in two steps.)
Thanks,
Michal

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] Revert "memcg: cleanup racy sum avoidance code"