Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers

16 Jul 2024

      On Tue, Jul 16, 2024 at 9:19 AM Michal Hocko mhocko@suse.com wrote:
...
On Tue 16-07-24 08:47:59, David Finkel wrote:
...
On Tue, Jul 16, 2024 at 3:20 AM Michal Hocko mhocko@suse.com wrote:
...
On Mon 15-07-24 16:46:36, David Finkel wrote:
...
...
On Mon, Jul 15, 2024 at 4:38 PM David Finkel davidf@vimeo.com wrote:
...
Other mechanisms for querying the peak memory usage of either a process
or v1 memory cgroup allow for resetting the high watermark. Restore
parity with those mechanisms.
For example:

Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.

This change copies the cgroup v1 behavior so any write to the
memory.peak and memory.swap.peak pseudo-files reset the high watermark
to the current usage.
This behavior is particularly useful for work scheduling systems that
need to track memory usage of worker processes/cgroups per-work-item.
Since memory can't be squeezed like CPU can (the OOM-killer has
opinions),
I do not understand the OOM-killer reference here. Why does it matter?
Could you explain please?
Sure, we're attempting to bin-packing work based on past items of the same type.
With CPU, we can provision for the mean CPU-time per-wall-time to get
a lose "cores"
concept that we use for binpacking. With CPU, if we end up with a bit
of contention,
everything just gets a bit slower while the schedule arbitrates among cgroups.
However, with memory, you only have so much physical memory for the outer memcg.
If we pack things too tightly on memory, the OOM-killer is going to kill
something to free up memory. In some cases that's fine, but provisioning for the
peak memory for that "type" of work-item mostly avoids this issue.
It is still not clear to me how the memory reclaim falls into that. Are
your workloads mostly unreclaimable (e.g. anon mostly consumers without
any swap)? Why I am asking? Well, if the workload's memory is
reclaimable then the peak memory consumption is largely misleading
because an unknown portion of that memory consumption is hidden by the
reclaimed portion of it. This is not really specific to the write
handlers to reset the value though so I do not want to digress this
patch too much. I do not have objections to the patch itself. Clarifying
the usecase with your followup here would be nice.
Thanks, I'm happy to clarify things!
That's a good point about peak-RSS being unreliable if the memory's reclaimable.
The memory is mostly unreclaimable. It's almost all anonymous mmap,
with a few local files that would be resident in buffercache. (but
generally aren't mmaped)
We don't run with swap enabled on the systems for a few reasons.
In particular, kubernetes disallows swap, which ties our hands, but
even if it didn't,
demand paging from disk tends to stall any useful work, so we'd rather
see the OOM-killer invoked, anyway.
(we actually have some plans for disabling OOM-kills in these cgroups
and letting the userspace process
managing these memcgs handle work-throttling and worker-killing when
there are OOM-conditions, but that's another story :) )
...
Thanks for the clarification!
Michal Hocko
SUSE Labs
-- 
David Finkel
Senior Principal Software Engineer, Core Services

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers

Thanks for the clarification!