Re: [For Stable] mm: memcontrol: fix excessive complexity in memory.stat reporting

1 May 2019


      On Tue, Apr 30, 2019 at 01:41:16PM -0700, Vaibhav Rustagi wrote:
...
On Wed, Apr 24, 2019 at 11:53 AM Greg KH gregkh@linuxfoundation.org wrote:
...
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
A: No.
Q: Should I include quotations after my reply?
http://daringfireball.net/2007/07/on_top
On Wed, Apr 24, 2019 at 10:35:51AM -0700, Vaibhav Rustagi wrote:
...
Apologies for sending a non-plain text e-mail previously.
This issue is encountered in the actual production environment by our
customers where they are constantly creating containers
and tearing them down (using kubernetes for the workload).  Kubernetes
constantly reads the memory.stat file for accounting memory
information and over time (around a week) the memcg's got accumulated
and the response time for reading memory.stat increases and
customer applications get affected.
Please define "affected".  Their apps still run properly, so all should
be fine, it would be kubernetes that sees the slowdowns, not the
application.  How exactly does this show up to an end-user?
Over time as the zombie cgroups get accumulated, kubelet (process
doing frequent memory.stat) becomes more cpu resource intensive and
all other user containers running on the same machine will starve for
cpu. It affects the user containers in at-least 2 ways that we know
of: (1) User experience liveness probe failures where there
applications are not completed in expected amount of time.
"expected amount of time" is interesting to claim in a shared
environment :)
...
(2) new user jobs cannot be schedule,
Really?  This slows down starting new processes?  Or is this just
slowing down your system overall?
...
There certainly is a possibilty of reducing the adverse affect at
Kubernetes level as well, and we are investigating that as well. But,
the kernel patches requested helps in not exacerbating the problem.
I understand this is a kernel issue, but if you see this happen, just
updating to a modern kernel should be fine.
...
...
...
The repro steps mentioned previously was just used for testing the
patches locally.
Yes, we are moving to 4.19 but are also supporting 4.14 till Jan 2020
(so production environment will still contain 4.14 kernel)
If you are already moving to 4.19, this seems like a good as reason as
any (hint, I can give you more) to move off of 4.14 at this point in
time.  There's no real need to keep 4.14 around, given that you don't
have any out-of-tree code in your kernels, so all should be simple to
just update the next reboot, right?
Based on the past experiences, major kernel upgrade sometime
introduces new regressions as well. So while we are working to roll
out kernel 4.19, it may not be a practical solution for all the users.
If you are not doing the same exact testing senario for a new 4.14.y
kernel release as you are doing for a move to 4.19.y, then your "roll
out" process is broken.
Given that 4.19.y is now 6 months old, I would have expected any "new
regressions" to have already been reported.  Please just use a new
kernel, and if you have regressions, we will work to address them.
thanks,
greg k-h

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [For Stable] mm: memcontrol: fix excessive complexity in memory.stat reporting