From: Hui Zhu zhuhui@kylinos.cn
This series adds BPF struct_ops support to the memory controller, enabling dynamic control over memory pressure through the memcg_nr_pages_over_high mechanism. This allows administrators to suppress low-priority cgroups' memory usage based on custom policies implemented in BPF programs.
Background and Motivation
The memory controller provides memory.high limits to throttle cgroups exceeding their soft limit. However, the current implementation applies the same policy across all cgroups without considering priority or workload characteristics.
This series introduces a BPF hook that allows reporting additional "pages over high" for specific cgroups, effectively increasing memory pressure and throttling for lower-priority workloads when higher-priority cgroups need resources.
Use Case: Priority-Based Memory Management
Consider a system running both latency-sensitive services and batch processing workloads. When the high-priority service experiences memory pressure (detected via page scan events), the BPF program can artificially inflate the "over high" count for low-priority cgroups, causing them to be throttled more aggressively and freeing up memory for the critical workload.
Implementation
This series builds upon Roman Gushchin's BPF OOM patch series in [1].
The implementation adds: 1. A memcg_bpf_ops struct_ops type with memcg_nr_pages_over_high hook 2. Integration into memory pressure calculation paths 3. Cgroup hierarchy management (inheritance during online/offline) 4. SRCU protection for safe concurrent access
Why Not PSI?
This implementation does not use PSI for triggering, as discussed in [2]. Instead, the sample code monitors PGSCAN events via tracepoints, which provides more direct feedback on memory pressure.
Example Results
Testing on x86_64 QEMU (10 CPU, 4GB RAM, cache=none swap): root@ubuntu:~# cat /proc/sys/vm/swappiness 60 root@ubuntu:~# mkdir /sys/fs/cgroup/high root@ubuntu:~# mkdir /sys/fs/cgroup/low root@ubuntu:~# ./memcg /sys/fs/cgroup/low /sys/fs/cgroup/high 100 1024 Successfully attached! root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 \ & cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 [1] 1075 stress-ng: info: [1075] setting to a 1 min, 0 secs run per stressor stress-ng: info: [1076] setting to a 1 min, 0 secs run per stressor stress-ng: info: [1075] dispatching hogs: 4 vm stress-ng: info: [1076] dispatching hogs: 4 vm stress-ng: metrc: [1076] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [1076] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [1076] vm 21033377 60.47 158.04 3.66 347825.55 130076.67 66.85 834836 stress-ng: info: [1076] skipped: 0 stress-ng: info: [1076] passed: 4: vm (4) stress-ng: info: [1076] failed: 0 stress-ng: info: [1076] metrics untrustworthy: 0 stress-ng: info: [1076] successful run completed in 1 min, 0.72 secs root@ubuntu:~# stress-ng: metrc: [1075] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [1075] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [1075] vm 11568 65.05 0.00 0.21 177.83 56123.74 0.08 3200 stress-ng: info: [1075] skipped: 0 stress-ng: info: [1075] passed: 4: vm (4) stress-ng: info: [1075] failed: 0 stress-ng: info: [1075] metrics untrustworthy: 0 stress-ng: info: [1075] successful run completed in 1 min, 5.06 secs
Results show the low-priority cgroup (/sys/fs/cgroup/low) was significantly throttled: - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s - Low-priority cgroup: 11,568 bogo ops at 177 ops/s
The stress-ng process in the low-priority cgroup experienced a ~99.9% slowdown in memory operations compared to the high-priority cgroup, demonstrating effective priority enforcement through BPF-controlled memory pressure.
Patch Overview
PATCH 1/3: Core kernel implementation - Adds memcg_bpf_ops struct_ops support - Implements cgroup lifecycle management - Integrates hook into pressure calculation
PATCH 2/3: Selftest suite - Validates attach/detach behavior - Tests hierarchy inheritance - Verifies throttling effectiveness
PATCH 3/3: Sample programs - Demonstrates PGSCAN-based triggering - Shows priority-based throttling - Provides reference implementation
Changelog: v2: According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF OOM patch series [1] and added hierarchical delegation support. According to the comments of Roman Gushchin and Michal Hocko, Designed concrete use case scenarios and provided test results.
[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.de... [2] https://lore.kernel.org/lkml/1d9a162605a3f32ac215430131f7745488deaa34@linux....
Hui Zhu (3): mm: memcontrol: Add BPF struct_ops for memory pressure control selftests/bpf: Add tests for memcg_bpf_ops samples/bpf: Add memcg priority control example
MAINTAINERS | 5 + include/linux/memcontrol.h | 2 + mm/bpf_memcontrol.c | 241 ++++++++++++- mm/bpf_memcontrol.h | 73 ++++ mm/memcontrol.c | 27 +- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 9 +- samples/bpf/memcg.bpf.c | 95 +++++ samples/bpf/memcg.c | 204 +++++++++++ .../selftests/bpf/prog_tests/memcg_ops.c | 340 ++++++++++++++++++ .../selftests/bpf/progs/memcg_ops_over_high.c | 95 +++++ 11 files changed, 1082 insertions(+), 10 deletions(-) create mode 100644 mm/bpf_memcontrol.h create mode 100644 samples/bpf/memcg.bpf.c create mode 100644 samples/bpf/memcg.c create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c