On 9/20/2023 10:43 AM, Michal Hocko wrote:
On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
6.1-stable review patch. If anyone has any objections, please let me know.
Hi Greg/Michal,
This commit breaks userspace which makes it a bad commit for mainline and an even worse commit for stable.
We ingested 6.1.54 into our nightly testing and found that runc fails to gather cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored into kubelet and kubelet fails to start if this operation fails. 6.1.53 is fine.
Could you expand some more on why is the file read? It doesn't support writing to it for some time so how does reading it helps in any sense?
Anyway, I do agree that the stable backport should be reverted.
This file is read together with all the other memcg files. Each prefix:
memory memory.memsw memory.kmem memory.kmem.tcp
is combined with these suffixes
.usage_in_bytes .max_usage_in_bytes .failcnt .limit_in_bytes
and read, the values are then forwarded on to other components for scheduling decisions. You want to know the limit when checking the usage (is the usage close to the limit or not).
Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the set missing is an anomaly. So maybe we could keep the dummy file just for the sake of consistency? Cgroupv1 is legacy after all.
Address this by wiping out the file completely and effectively get back to pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
On reads, the runc code checks for MEMCG_KMEM=n by checking kmem.usage_in_bytes. If it is present then runc expects the other cgroup files to be there (including kmem.limit_in_bytes). So this change is not effectively the same.
Here's a link to the PR that would be needed to handle this change in userspace (not merged yet and would need to be propagated through the ecosystem):
Thanks. Does that mean the revert is still necessary for the Linus tree or do you expect that the fix can be merged and propagated in a reasonable time?
We can probably get runc and currently supported kubernetes versions patched in time before 6.6 (or the next LTS kernel) hits LTS distros.
But there's still a bunch of users running cgroupv1 with unsupported kubernetes versions that are still taking kernel updates as they come, so this might get reported again next year if it stays in mainline.