v3: - Take up Johannes' suggestion of just skip the !usage case and fix test_memcontrol selftest to fix the rests of the min/low failures.
The test_memcontrol selftest consistently fails its test_memcg_low sub-test and sporadically fails its test_memcg_min sub-test. This patchset fixes the test_memcg_min and test_memcg_low failures by skipping the !usage case in shrink_node_memcgs() and adjust the test_memcontrol selftest to fix other causes of the test failures.
Note that I decide not to use the suggested mem_cgroup_usage() call as it is a real function call defined in mm/memcontrol.c which is not available if CONFIG_MEMCG isn't defined.
Waiman Long (2): mm/vmscan: Skip memcg with !usage in shrink_node_memcgs() selftests: memcg: Increase error tolerance of child memory.current check in test_memcg_protection()
mm/vmscan.c | 4 ++++ tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-)
The test_memcontrol selftest consistently fails its test_memcg_low sub-test due to the fact that two of its test child cgroups which have a memmory.low of 0 or an effective memory.low of 0 still have low events generated for them since mem_cgroup_below_low() use the ">=" operator when comparing to elow.
The two failed use cases are as follows:
1) memory.low is set to 0, but low events can still be triggered and so the cgroup may have a non-zero low event count. I doubt users are looking for that as they didn't set memory.low at all.
2) memory.low is set to a non-zero value but the cgroup has no task in it so that it has an effective low value of 0. Again it may have a non-zero low event count if memory reclaim happens. This is probably not a result expected by the users and it is really doubtful that users will check an empty cgroup with no task in it and expecting some non-zero event counts.
In the first case, even though memory.low isn't set, it may still have some low protection if memory.low is set in the parent. So low event may still be recorded. The test_memcontrol.c test has to be modified to account for that.
For the second case, it really doesn't make sense to have non-zero low event if the cgroup has 0 usage. So we need to skip this corner case in shrink_node_memcgs().
With this patch applied, the test_memcg_low sub-test finishes successfully without failure in most cases. Though both test_memcg_low and test_memcg_min sub-tests may still fail occasionally if the memory.current values fall outside of the expected ranges.
Suggested-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Waiman Long longman@redhat.com --- mm/vmscan.c | 4 ++++ tools/testing/selftests/cgroup/test_memcontrol.c | 7 ++++++- 2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index b620d74b0f66..2a2957b9dc99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5963,6 +5963,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
mem_cgroup_calculate_protection(target_memcg, memcg);
+ /* Skip memcg with no usage */ + if (!page_counter_read(&memcg->memory)) + continue; + if (mem_cgroup_below_min(target_memcg, memcg)) { /* * Hard protection. diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 16f5d74ae762..bab826b6b7b0 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -525,8 +525,13 @@ static int test_memcg_protection(const char *root, bool min) goto cleanup; }
+ /* + * Child 2 has memory.low=0, but some low protection is still being + * distributed down from its parent with memory.low=50M. So the low + * event count will be non-zero. + */ for (i = 0; i < ARRAY_SIZE(children); i++) { - int no_low_events_index = 1; + int no_low_events_index = 2; long low, oom;
oom = cg_read_key_long(children[i], "memory.events", "oom ");
Hi Waiman,
kernel test robot noticed the following build errors:
[auto build test ERROR on tj-cgroup/for-next] [also build test ERROR on akpm-mm/mm-everything linus/master v6.14 next-20250404] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/mm-vmscan-Skip-me... base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20250406024010.1177927-2-longman%40redhat.com patch subject: [PATCH v3 1/2] mm/vmscan: Skip memcg with !usage in shrink_node_memcgs() config: arc-randconfig-002-20250406 (https://download.01.org/0day-ci/archive/20250406/202504061257.GMkEJUOs-lkp@i...) compiler: arc-linux-gcc (GCC) 11.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250406/202504061257.GMkEJUOs-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202504061257.GMkEJUOs-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'shrink_node_memcgs':
mm/vmscan.c:5929:46: error: invalid use of undefined type 'struct mem_cgroup'
5929 | if (!page_counter_read(&memcg->memory)) | ^~
vim +5929 mm/vmscan.c
5890 5891 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) 5892 { 5893 struct mem_cgroup *target_memcg = sc->target_mem_cgroup; 5894 struct mem_cgroup_reclaim_cookie reclaim = { 5895 .pgdat = pgdat, 5896 }; 5897 struct mem_cgroup_reclaim_cookie *partial = &reclaim; 5898 struct mem_cgroup *memcg; 5899 5900 /* 5901 * In most cases, direct reclaimers can do partial walks 5902 * through the cgroup tree, using an iterator state that 5903 * persists across invocations. This strikes a balance between 5904 * fairness and allocation latency. 5905 * 5906 * For kswapd, reliable forward progress is more important 5907 * than a quick return to idle. Always do full walks. 5908 */ 5909 if (current_is_kswapd() || sc->memcg_full_walk) 5910 partial = NULL; 5911 5912 memcg = mem_cgroup_iter(target_memcg, NULL, partial); 5913 do { 5914 struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 5915 unsigned long reclaimed; 5916 unsigned long scanned; 5917 5918 /* 5919 * This loop can become CPU-bound when target memcgs 5920 * aren't eligible for reclaim - either because they 5921 * don't have any reclaimable pages, or because their 5922 * memory is explicitly protected. Avoid soft lockups. 5923 */ 5924 cond_resched(); 5925 5926 mem_cgroup_calculate_protection(target_memcg, memcg); 5927 5928 /* Skip memcg with no usage */
5929 if (!page_counter_read(&memcg->memory))
5930 continue; 5931 5932 if (mem_cgroup_below_min(target_memcg, memcg)) { 5933 /* 5934 * Hard protection. 5935 * If there is no reclaimable memory, OOM. 5936 */ 5937 continue; 5938 } else if (mem_cgroup_below_low(target_memcg, memcg)) { 5939 /* 5940 * Soft protection. 5941 * Respect the protection only as long as 5942 * there is an unprotected supply 5943 * of reclaimable memory from other cgroups. 5944 */ 5945 if (!sc->memcg_low_reclaim) { 5946 sc->memcg_low_skipped = 1; 5947 continue; 5948 } 5949 memcg_memory_event(memcg, MEMCG_LOW); 5950 } 5951 5952 reclaimed = sc->nr_reclaimed; 5953 scanned = sc->nr_scanned; 5954 5955 shrink_lruvec(lruvec, sc); 5956 5957 shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, 5958 sc->priority); 5959 5960 /* Record the group's reclaim efficiency */ 5961 if (!sc->proactive) 5962 vmpressure(sc->gfp_mask, memcg, false, 5963 sc->nr_scanned - scanned, 5964 sc->nr_reclaimed - reclaimed); 5965 5966 /* If partial walks are allowed, bail once goal is reached */ 5967 if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { 5968 mem_cgroup_iter_break(target_memcg, memcg); 5969 break; 5970 } 5971 } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); 5972 } 5973
Hi Waiman,
kernel test robot noticed the following build errors:
[auto build test ERROR on tj-cgroup/for-next] [also build test ERROR on akpm-mm/mm-everything linus/master v6.14 next-20250404] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/mm-vmscan-Skip-me... base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20250406024010.1177927-2-longman%40redhat.com patch subject: [PATCH v3 1/2] mm/vmscan: Skip memcg with !usage in shrink_node_memcgs() config: arm-randconfig-001-20250406 (https://download.01.org/0day-ci/archive/20250406/202504061254.DqfqHfM7-lkp@i...) compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 92c93f5286b9ff33f27ff694d2dc33da1c07afdd) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250406/202504061254.DqfqHfM7-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202504061254.DqfqHfM7-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/vmscan.c:5929:32: error: incomplete definition of type 'struct mem_cgroup'
5929 | if (!page_counter_read(&memcg->memory)) | ~~~~~^ include/linux/mm_types.h:33:8: note: forward declaration of 'struct mem_cgroup' 33 | struct mem_cgroup; | ^ 1 error generated.
vim +5929 mm/vmscan.c
5890 5891 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) 5892 { 5893 struct mem_cgroup *target_memcg = sc->target_mem_cgroup; 5894 struct mem_cgroup_reclaim_cookie reclaim = { 5895 .pgdat = pgdat, 5896 }; 5897 struct mem_cgroup_reclaim_cookie *partial = &reclaim; 5898 struct mem_cgroup *memcg; 5899 5900 /* 5901 * In most cases, direct reclaimers can do partial walks 5902 * through the cgroup tree, using an iterator state that 5903 * persists across invocations. This strikes a balance between 5904 * fairness and allocation latency. 5905 * 5906 * For kswapd, reliable forward progress is more important 5907 * than a quick return to idle. Always do full walks. 5908 */ 5909 if (current_is_kswapd() || sc->memcg_full_walk) 5910 partial = NULL; 5911 5912 memcg = mem_cgroup_iter(target_memcg, NULL, partial); 5913 do { 5914 struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 5915 unsigned long reclaimed; 5916 unsigned long scanned; 5917 5918 /* 5919 * This loop can become CPU-bound when target memcgs 5920 * aren't eligible for reclaim - either because they 5921 * don't have any reclaimable pages, or because their 5922 * memory is explicitly protected. Avoid soft lockups. 5923 */ 5924 cond_resched(); 5925 5926 mem_cgroup_calculate_protection(target_memcg, memcg); 5927 5928 /* Skip memcg with no usage */
5929 if (!page_counter_read(&memcg->memory))
5930 continue; 5931 5932 if (mem_cgroup_below_min(target_memcg, memcg)) { 5933 /* 5934 * Hard protection. 5935 * If there is no reclaimable memory, OOM. 5936 */ 5937 continue; 5938 } else if (mem_cgroup_below_low(target_memcg, memcg)) { 5939 /* 5940 * Soft protection. 5941 * Respect the protection only as long as 5942 * there is an unprotected supply 5943 * of reclaimable memory from other cgroups. 5944 */ 5945 if (!sc->memcg_low_reclaim) { 5946 sc->memcg_low_skipped = 1; 5947 continue; 5948 } 5949 memcg_memory_event(memcg, MEMCG_LOW); 5950 } 5951 5952 reclaimed = sc->nr_reclaimed; 5953 scanned = sc->nr_scanned; 5954 5955 shrink_lruvec(lruvec, sc); 5956 5957 shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, 5958 sc->priority); 5959 5960 /* Record the group's reclaim efficiency */ 5961 if (!sc->proactive) 5962 vmpressure(sc->gfp_mask, memcg, false, 5963 sc->nr_scanned - scanned, 5964 sc->nr_reclaimed - reclaimed); 5965 5966 /* If partial walks are allowed, bail once goal is reached */ 5967 if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { 5968 mem_cgroup_iter_break(target_memcg, memcg); 5969 break; 5970 } 5971 } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); 5972 } 5973
The test_memcg_protection() function is used for the test_memcg_min and test_memcg_low sub-tests. This function generates a set of parent/child cgroups like:
parent: memory.min/low = 50M child 0: memory.min/low = 75M, memory.current = 50M child 1: memory.min/low = 25M, memory.current = 50M child 2: memory.min/low = 0, memory.current = 50M
After applying memory pressure, the function expects the following actual memory usages.
parent: memory.current ~= 50M child 0: memory.current ~= 29M child 1: memory.current ~= 21M child 2: memory.current ~= 0
In reality, the actual memory usages can differ quite a bit from the expected values. It uses an error tolerance of 10% with the values_close() helper.
Both the test_memcg_min and test_memcg_low sub-tests can fail sporadically because the actual memory usage exceeds the 10% error tolerance. Below are a sample of the usage data of the tests runs that fail.
Child Actual usage Expected usage %err ----- ------------ -------------- ---- 1 16990208 22020096 -12.9% 1 17252352 22020096 -12.1% 0 37699584 30408704 +10.7% 1 14368768 22020096 -21.0% 1 16871424 22020096 -13.2%
The current 10% error tolerenace might be right at the time test_memcontrol.c was first introduced in v4.18 kernel, but memory reclaim have certainly evolved quite a bit since then which may result in a bit more run-to-run variation than previously expected.
Increase the error tolerance to 15% for child 0 and 20% for child 1 to minimize the chance of this type of failure. The tolerance is bigger for child 1 because an upswing in child 0 corresponds to a smaller %err than a similar downswing in child 1 due to the way %err is used in values_close().
Before this patch, a 100 test runs of test_memcontrol produced the following results:
17 not ok 1 test_memcg_min 22 not ok 2 test_memcg_low
After applying this patch, there were no test failure for test_memcg_min and test_memcg_low in 100 test runs.
Signed-off-by: Waiman Long longman@redhat.com --- tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index bab826b6b7b0..8f4f2479650e 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -495,10 +495,10 @@ static int test_memcg_protection(const char *root, bool min) for (i = 0; i < ARRAY_SIZE(children); i++) c[i] = cg_read_long(children[i], "memory.current");
- if (!values_close(c[0], MB(29), 10)) + if (!values_close(c[0], MB(29), 15)) goto cleanup;
- if (!values_close(c[1], MB(21), 10)) + if (!values_close(c[1], MB(21), 20)) goto cleanup;
if (c[3] != 0)
On Sat, Apr 05, 2025 at 10:40:10PM -0400, Waiman Long wrote:
The test_memcg_protection() function is used for the test_memcg_min and test_memcg_low sub-tests. This function generates a set of parent/child cgroups like:
parent: memory.min/low = 50M child 0: memory.min/low = 75M, memory.current = 50M child 1: memory.min/low = 25M, memory.current = 50M child 2: memory.min/low = 0, memory.current = 50M
After applying memory pressure, the function expects the following actual memory usages.
parent: memory.current ~= 50M child 0: memory.current ~= 29M child 1: memory.current ~= 21M child 2: memory.current ~= 0
In reality, the actual memory usages can differ quite a bit from the expected values. It uses an error tolerance of 10% with the values_close() helper.
Both the test_memcg_min and test_memcg_low sub-tests can fail sporadically because the actual memory usage exceeds the 10% error tolerance. Below are a sample of the usage data of the tests runs that fail.
Child Actual usage Expected usage %err
1 16990208 22020096 -12.9% 1 17252352 22020096 -12.1% 0 37699584 30408704 +10.7% 1 14368768 22020096 -21.0% 1 16871424 22020096 -13.2%
The current 10% error tolerenace might be right at the time test_memcontrol.c was first introduced in v4.18 kernel, but memory reclaim have certainly evolved quite a bit since then which may result in a bit more run-to-run variation than previously expected.
Increase the error tolerance to 15% for child 0 and 20% for child 1 to minimize the chance of this type of failure. The tolerance is bigger for child 1 because an upswing in child 0 corresponds to a smaller %err than a similar downswing in child 1 due to the way %err is used in values_close().
Before this patch, a 100 test runs of test_memcontrol produced the following results:
17 not ok 1 test_memcg_min 22 not ok 2 test_memcg_low
After applying this patch, there were no test failure for test_memcg_min and test_memcg_low in 100 test runs.
Ideally we want to calculate these values dynamically based on the machine size (number of cpus and total memory size).
We can calculate the memcg error margin and scale memcg sizes if necessarily. It's the only way to make it pass both on a 2-CPU's vm and 512-CPU's physical server.
Not a blocker for this patch, just an idea for the future.
Thanks!
On 4/8/25 6:22 PM, Roman Gushchin wrote:
On Sat, Apr 05, 2025 at 10:40:10PM -0400, Waiman Long wrote:
The test_memcg_protection() function is used for the test_memcg_min and test_memcg_low sub-tests. This function generates a set of parent/child cgroups like:
parent: memory.min/low = 50M child 0: memory.min/low = 75M, memory.current = 50M child 1: memory.min/low = 25M, memory.current = 50M child 2: memory.min/low = 0, memory.current = 50M
After applying memory pressure, the function expects the following actual memory usages.
parent: memory.current ~= 50M child 0: memory.current ~= 29M child 1: memory.current ~= 21M child 2: memory.current ~= 0
In reality, the actual memory usages can differ quite a bit from the expected values. It uses an error tolerance of 10% with the values_close() helper.
Both the test_memcg_min and test_memcg_low sub-tests can fail sporadically because the actual memory usage exceeds the 10% error tolerance. Below are a sample of the usage data of the tests runs that fail.
Child Actual usage Expected usage %err
1 16990208 22020096 -12.9% 1 17252352 22020096 -12.1% 0 37699584 30408704 +10.7% 1 14368768 22020096 -21.0% 1 16871424 22020096 -13.2%
The current 10% error tolerenace might be right at the time test_memcontrol.c was first introduced in v4.18 kernel, but memory reclaim have certainly evolved quite a bit since then which may result in a bit more run-to-run variation than previously expected.
Increase the error tolerance to 15% for child 0 and 20% for child 1 to minimize the chance of this type of failure. The tolerance is bigger for child 1 because an upswing in child 0 corresponds to a smaller %err than a similar downswing in child 1 due to the way %err is used in values_close().
Before this patch, a 100 test runs of test_memcontrol produced the following results:
17 not ok 1 test_memcg_min 22 not ok 2 test_memcg_low
After applying this patch, there were no test failure for test_memcg_min and test_memcg_low in 100 test runs.
Ideally we want to calculate these values dynamically based on the machine size (number of cpus and total memory size).
We can calculate the memcg error margin and scale memcg sizes if necessarily. It's the only way to make it pass both on a 2-CPU's vm and 512-CPU's physical server.
Not a blocker for this patch, just an idea for the future.
Thanks for the suggestion.
As I said in a previous mail, the way the test works is by waiting until the the memory.current of the parent is close to 50M, then it checks the memory.current's of its children to see how much usage each of them have. I am not sure if nr of CPUs or total memory size is really a factor here. We will probably need to run some experiments to find out. Anyway, it will be a future patch if they are really a factor here.
Cheers, Longman
Thanks!
linux-kselftest-mirror@lists.linaro.org