This patch series addresses two issues in demote_folio_list() and can_demote() in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However:
1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective.
To address these bugs, implement a new function mem_cgroup_filter_mems_allowed() to filter out nodes that are not set in mems_effective, and update demote_folio_list() and can_demote() accordingly.
Reproduct Bug 1: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script to reproduct: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Reproduct Bug 2: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity.
Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Bing Jiao (2): mm/vmscan: respect mems_effective in demote_folio_list() mm/vmscan: check all allowed targets in can_demote()
include/linux/cpuset.h | 6 +++--- include/linux/memcontrol.h | 6 +++--- kernel/cgroup/cpuset.c | 12 +++++------- mm/memcontrol.c | 5 +++-- mm/vmscan.c | 27 ++++++++++++++++++--------- 5 files changed, 32 insertions(+), 24 deletions(-)
-- 2.52.0.351.gbe84eed79e-goog
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it does not apply this check in demote_folio_list().
This omission leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. The impact is two-fold:
1. Resource Isolation: This bug breaks resource isolation provided by cpuset.mems. It allows pages to be demoted to nodes that are dedicated to other tasks or are intended for hot-unplugging.
2. Performance Issue: In multi-tier systems, users use cpuset.mems to bind tasks to different performed-far tiers (e.g., avoiding the slowest tiers for latency-sensitive data). This bug can cause unexpected latency spikes if pages are demoted to the farthest nodes.
To address the bug, implement a new function mem_cgroup_filter_mems_allowed() to filter out nodes that are not set in mems_effective, and update demote_folio_list() to utilize this filtering logic. This ensures that demotions target respect task's memory placement constraints.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao bingjiao@google.com --- include/linux/cpuset.h | 6 ++++++ include/linux/memcontrol.h | 7 +++++++ kernel/cgroup/cpuset.c | 18 ++++++++++++++++++ mm/memcontrol.c | 6 ++++++ mm/vmscan.c | 13 ++++++++++--- 5 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index a98d3330385c..0e94548e2d24 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -175,6 +175,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) }
extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); +extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask); #else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; } @@ -305,6 +306,11 @@ static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) { return true; } + +static inline void cpuset_node_filter_allowed(struct cgroup *cgroup, + nodemask_t *mask) +{ +} #endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fd400082313a..7cfd71c57caa 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1742,6 +1742,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask); + void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
static inline bool memcg_is_dying(struct mem_cgroup *memcg) @@ -1816,6 +1818,11 @@ static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) return true; }
+static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, + nodemask_t *mask) +{ +} + static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) { } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6e6eb09b8db6..2925bd6bca91 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4452,6 +4452,24 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) return allowed; }
+void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) +{ + struct cgroup_subsys_state *css; + struct cpuset *cs; + + if (!cpuset_v2()) + return; + + css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); + if (!css) + return; + + /* Follows the same assumption in cpuset_node_allowed() */ + cs = container_of(css, struct cpuset, css); + nodes_and(*mask, *mask, cs->effective_mems); + css_put(css); +} + /** * cpuset_spread_node() - On which node to begin search for a page * @rotor: round robin rotor diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75fc22a33b28..f414653867de 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5602,6 +5602,12 @@ bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; }
+void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) +{ + if (memcg) + cpuset_node_filter_allowed(memcg->css.cgroup, mask); +} + void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) { if (mem_cgroup_disabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 453d654727c1..4d23c491e914 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1018,7 +1018,8 @@ static struct folio *alloc_demote_folio(struct folio *src, * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + struct mem_cgroup *memcg) { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1032,7 +1033,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, */ .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = target_nid, .nmask = &allowed_mask, .reason = MR_DEMOTION, }; @@ -1044,6 +1044,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, return 0;
node_get_allowed_targets(pgdat, &allowed_mask); + /* Filter the given nmask based on cpuset.mems.allowed */ + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask); + if (nodes_empty(allowed_mask)) + return 0; + if (!node_isset(target_nid, allowed_mask)) + target_nid = node_random(&allowed_mask); + mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, @@ -1565,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */ - nr_demoted = demote_folio_list(&demote_folios, pgdat); + nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg); nr_reclaimed += nr_demoted; stat->nr_demoted += nr_demoted; /* Folios that could not be demoted are still in @demote_folios */
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() Link: https://lore.kernel.org/stable/20251221233635.3761887-2-bingjiao%40google.co...
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets. This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective.
To address the bug, use mem_cgroup_filter_mems_allowed() to filter out allowed targets obtained from node_get_allowed_targets(). Also remove some unused functions.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao bingjiao@google.com --- include/linux/cpuset.h | 6 ------ include/linux/memcontrol.h | 7 ------- kernel/cgroup/cpuset.c | 28 ++++------------------------ mm/memcontrol.c | 5 ----- mm/vmscan.c | 14 ++++++++------ 5 files changed, 12 insertions(+), 48 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 0e94548e2d24..ed7c27276e71 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,6 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); }
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask); #else /* !CONFIG_CPUSETS */
@@ -302,11 +301,6 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; }
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) -{ - return true; -} - static inline void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7cfd71c57caa..41aab33499b5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1740,8 +1740,6 @@ static inline void count_objcg_events(struct obj_cgroup *objcg, rcu_read_unlock(); }
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); - void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); @@ -1813,11 +1811,6 @@ static inline ino_t page_cgroup_ino(struct page *page) return 0; }
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) -{ - return true; -} - static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) { diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2925bd6bca91..339779571508 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) return allowed; }
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { struct cgroup_subsys_state *css; struct cpuset *cs; - bool allowed;
/* * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy @@ -4428,15 +4427,15 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) * so return true to avoid taking a global lock on the empty check. */ if (!cpuset_v2()) - return true; + return;
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); if (!css) - return true; + return;
/* * Normally, accessing effective_mems would require the cpuset_mutex - * or callback_lock - but node_isset is atomic and the reference + * or callback_lock - but it is acceptable and the reference * taken via cgroup_get_e_css is sufficient to protect css. * * Since this interface is intended for use by migration paths, we @@ -4447,25 +4446,6 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) * cannot make strong isolation guarantees, so this is acceptable. */ cs = container_of(css, struct cpuset, css); - allowed = node_isset(nid, cs->effective_mems); - css_put(css); - return allowed; -} - -void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) -{ - struct cgroup_subsys_state *css; - struct cpuset *cs; - - if (!cpuset_v2()) - return; - - css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); - if (!css) - return; - - /* Follows the same assumption in cpuset_node_allowed() */ - cs = container_of(css, struct cpuset, css); nodes_and(*mask, *mask, cs->effective_mems); css_put(css); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f414653867de..ebf5df3c8ca1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5597,11 +5597,6 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) -{ - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; -} - void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) { if (memcg) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4d23c491e914..fa4d51af7f44 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc) static bool can_demote(int nid, struct scan_control *sc, struct mem_cgroup *memcg) { - int demotion_nid; + struct pglist_data *pgdat = NODE_DATA(nid); + nodemask_t allowed_mask;
- if (!numa_demotion_enabled) + if (!pgdat || !numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false;
- demotion_nid = next_demotion_node(nid); - if (demotion_nid == NUMA_NO_NODE) + node_get_allowed_targets(pgdat, &allowed_mask); + if (nodes_empty(allowed_mask)) return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - return mem_cgroup_node_allowed(memcg, demotion_nid); + /* Filter the given nmask based on cpuset.mems.allowed */ + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask); + return !nodes_empty(allowed_mask); }
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
On 2025/12/22 14:09, Bing Jiao wrote:
On Mon, Dec 22, 2025 at 10:51:49AM +0800, Chen Ridong wrote:
On 2025/12/22 7:36, Bing Jiao wrote:
-void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) -{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
- if (!cpuset_v2())
return;- css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
return;- /* Follows the same assumption in cpuset_node_allowed() */
- cs = container_of(css, struct cpuset, css); nodes_and(*mask, *mask, cs->effective_mems); css_put(css);
}
Oh, I see you merged these two functions here.
However, I think cpuset_get_mem_allowed would be more versatile in general use.
You can then check whether the returned nodemask intersects with your target mask. In the future, there may be scenarios where users simply want to retrieve the effective masks directly.
Hi Ridong, thank you for the suggestions.
I agree that returning a nodemask would provide greater versatility.
I think cpuset_get_mem_allowed_relax() would be a better name, since we do not need the locking and online mem guarantees compared to an similar function cpuset_mems_allowed().
I think the key difference between cpuset_mems_allowed and the helper you intend to implement lies not in locking or online memory guarantees, but in the input parameter: you want to retrieve cpuset->effective_mems for a cgroup from another subsystem.
The cs->effective_mems should typically only include online nodes, except during brief transitional periods such as hotplug operations. Similarly, node migration logic also requires online nodes.
Therefore, cpuset_get_mem_allowed seems acceptable to me.
Additionally, you may consider calling guarantee_online_mems inside your new helper to ensure consistency.
On 2025/12/22 7:36, Bing Jiao wrote:
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it does not apply this check in demote_folio_list().
This omission leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. The impact is two-fold:
Resource Isolation: This bug breaks resource isolation provided by cpuset.mems. It allows pages to be demoted to nodes that are dedicated to other tasks or are intended for hot-unplugging.
Performance Issue: In multi-tier systems, users use cpuset.mems to bind tasks to different performed-far tiers (e.g., avoiding the slowest tiers for latency-sensitive data). This bug can cause unexpected latency spikes if pages are demoted to the farthest nodes.
To address the bug, implement a new function mem_cgroup_filter_mems_allowed() to filter out nodes that are not set in mems_effective, and update demote_folio_list() to utilize this filtering logic. This ensures that demotions target respect task's memory placement constraints.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao bingjiao@google.com
include/linux/cpuset.h | 6 ++++++ include/linux/memcontrol.h | 7 +++++++ kernel/cgroup/cpuset.c | 18 ++++++++++++++++++ mm/memcontrol.c | 6 ++++++ mm/vmscan.c | 13 ++++++++++--- 5 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index a98d3330385c..0e94548e2d24 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -175,6 +175,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) } extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); +extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask); #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -305,6 +306,11 @@ static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) { return true; }
+static inline void cpuset_node_filter_allowed(struct cgroup *cgroup,
nodemask_t *mask)+{ +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fd400082313a..7cfd71c57caa 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1742,6 +1742,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg, bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); static inline bool memcg_is_dying(struct mem_cgroup *memcg) @@ -1816,6 +1818,11 @@ static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) return true; } +static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg,
nodemask_t *mask)+{ +}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) { } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6e6eb09b8db6..2925bd6bca91 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4452,6 +4452,24 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) return allowed; } +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) +{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
- if (!cpuset_v2())
return;- css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
return;- /* Follows the same assumption in cpuset_node_allowed() */
- cs = container_of(css, struct cpuset, css);
- nodes_and(*mask, *mask, cs->effective_mems);
- css_put(css);
+}
The functions cpuset_node_filter_allowed and cpuset_node_allowed are similar. We should create a helper function to obtain cs->effective_mems, which can then be used by both cpuset_node_filter_allowed and cpuset_node_allowed.
For example:
nodemask_t *mask cpuset_get_mem_allowed(struct cgroup *cgroup) { }
bool cpuset_node_allowed(struct cgroup *cgroup, int nid) { e_mask = cpuset_node_allowed(cgroup); return allowed = node_isset(nid, mask); }
void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t mask) { e_mask = cpuset_node_allowed(cgroup); nodes_and(mask, *mask, e_mask); }
Previously, I did not think we should distinguish between cgroup v1 and v2 here. This should be a common function; at least based on its name, it should not be solely for v2.
/**
- cpuset_spread_node() - On which node to begin search for a page
- @rotor: round robin rotor
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75fc22a33b28..f414653867de 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5602,6 +5602,12 @@ bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; } +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) +{
- if (memcg)
cpuset_node_filter_allowed(memcg->css.cgroup, mask);+}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) { if (mem_cgroup_disabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 453d654727c1..4d23c491e914 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1018,7 +1018,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
- Folios which are not demoted are left on @demote_folios.
*/ static unsigned int demote_folio_list(struct list_head *demote_folios,
struct pglist_data *pgdat)
struct pglist_data *pgdat,struct mem_cgroup *memcg){ int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1032,7 +1033,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, */ .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOMEMALLOC | GFP_NOWAIT,
.nmask = &allowed_mask, .reason = MR_DEMOTION, };.nid = target_nid,@@ -1044,6 +1044,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, return 0; node_get_allowed_targets(pgdat, &allowed_mask);
- /* Filter the given nmask based on cpuset.mems.allowed */
- mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
- if (nodes_empty(allowed_mask))
return 0;- if (!node_isset(target_nid, allowed_mask))
target_nid = node_random(&allowed_mask);- mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, @@ -1565,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
- nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg); nr_reclaimed += nr_demoted; stat->nr_demoted += nr_demoted; /* Folios that could not be demoted are still in @demote_folios */
On 2025/12/22 7:36, Bing Jiao wrote:
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets. This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective.
To address the bug, use mem_cgroup_filter_mems_allowed() to filter out allowed targets obtained from node_get_allowed_targets(). Also remove some unused functions.
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao bingjiao@google.com
include/linux/cpuset.h | 6 ------ include/linux/memcontrol.h | 7 ------- kernel/cgroup/cpuset.c | 28 ++++------------------------ mm/memcontrol.c | 5 ----- mm/vmscan.c | 14 ++++++++------ 5 files changed, 12 insertions(+), 48 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 0e94548e2d24..ed7c27276e71 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,6 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask); #else /* !CONFIG_CPUSETS */ @@ -302,11 +301,6 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) -{
- return true;
-}
static inline void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7cfd71c57caa..41aab33499b5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1740,8 +1740,6 @@ static inline void count_objcg_events(struct obj_cgroup *objcg, rcu_read_unlock(); } -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask); void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); @@ -1813,11 +1811,6 @@ static inline ino_t page_cgroup_ino(struct page *page) return 0; } -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) -{
- return true;
-}
static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) { diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2925bd6bca91..339779571508 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) return allowed; } -bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { struct cgroup_subsys_state *css; struct cpuset *cs;
- bool allowed;
/* * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy @@ -4428,15 +4427,15 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) * so return true to avoid taking a global lock on the empty check. */ if (!cpuset_v2())
return true;
return;css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); if (!css)
return true;
return;/* * Normally, accessing effective_mems would require the cpuset_mutex
* or callback_lock - but node_isset is atomic and the reference
* or callback_lock - but it is acceptable and the reference
- taken via cgroup_get_e_css is sufficient to protect css.
- Since this interface is intended for use by migration paths, we
@@ -4447,25 +4446,6 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid) * cannot make strong isolation guarantees, so this is acceptable. */ cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
- css_put(css);
- return allowed;
-}
-void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) -{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
- if (!cpuset_v2())
return;- css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
return;- /* Follows the same assumption in cpuset_node_allowed() */
- cs = container_of(css, struct cpuset, css); nodes_and(*mask, *mask, cs->effective_mems); css_put(css);
}
Oh, I see you merged these two functions here.
However, I think cpuset_get_mem_allowed would be more versatile in general use.
You can then check whether the returned nodemask intersects with your target mask. In the future, there may be scenarios where users simply want to retrieve the effective masks directly.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f414653867de..ebf5df3c8ca1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5597,11 +5597,6 @@ subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) -{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
-}
void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) { if (memcg) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4d23c491e914..fa4d51af7f44 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc) static bool can_demote(int nid, struct scan_control *sc, struct mem_cgroup *memcg) {
- int demotion_nid;
- struct pglist_data *pgdat = NODE_DATA(nid);
- nodemask_t allowed_mask;
- if (!numa_demotion_enabled)
- if (!pgdat || !numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
- node_get_allowed_targets(pgdat, &allowed_mask);
- if (nodes_empty(allowed_mask)) return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
- /* Filter the given nmask based on cpuset.mems.allowed */
- mem_cgroup_filter_mems_allowed(memcg, &allowed_mask);
- return !nodes_empty(allowed_mask);
} static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
On Mon, Dec 22, 2025 at 10:51:49AM +0800, Chen Ridong wrote:
On 2025/12/22 7:36, Bing Jiao wrote:
-void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) -{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
- if (!cpuset_v2())
return;- css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
return;- /* Follows the same assumption in cpuset_node_allowed() */
- cs = container_of(css, struct cpuset, css); nodes_and(*mask, *mask, cs->effective_mems); css_put(css);
}
Oh, I see you merged these two functions here.
However, I think cpuset_get_mem_allowed would be more versatile in general use.
You can then check whether the returned nodemask intersects with your target mask. In the future, there may be scenarios where users simply want to retrieve the effective masks directly.
Hi Ridong, thank you for the suggestions.
I agree that returning a nodemask would provide greater versatility.
I think cpuset_get_mem_allowed_relax() would be a better name, since we do not need the locking and online mem guarantees compared to an similar function cpuset_mems_allowed().
Best, Bing
linux-stable-mirror@lists.linaro.org