This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Here's the explanation of this issue. memory.{low,min} won't take effect if the to-be-reclaimed memcg is the sc->target_mem_cgroup, that can also be proved by the implementation in mem_cgroup_protected(), see bellow, mem_cgroup_protected if (memcg == root) [2] return MEMCG_PROT_NONE;
But this rule is ignored in mem_cgroup_protection(), which will read memory.{emin, elow} as the protection whatever the memcg is.
How would this issue happen? Because in mem_cgroup_protected() we forget to clear the memory.{emin, elow} if the memcg is target_mem_cgroup [2].
An example to illustrate this issue. root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 800M ('current' must be greater than 'min') Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A. Then kswapd stops. As a result of it, the memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 512M (approximately) memory.emin: 512M
Then a new workload starts to run in memcg A, and it will trigger memcg relcaim in A soon. As memcg A is the target_mem_cgroup of this reclaimer, so it return directly without touching memory.{emin, elow}.[2] The memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 1024M (approximately) memory.emin: 512M Then this memory.emin will be used in mem_cgroup_protection() to get the scan count, which is obvoiusly a wrong scan count.
[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Cc: Chris Down chris@chrisdown.name Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org Signed-off-by: Yafang Shao laoar.shao@gmail.com --- include/linux/memcontrol.h | 13 +++++++++++-- mm/vmscan.c | 4 ++-- 2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d275c72c4f8e..114cfe06bf60 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -344,12 +344,20 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); }
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, + struct mem_cgroup *memcg, bool in_low_reclaim) { if (mem_cgroup_disabled()) return 0;
+ /* + * Memcg protection won't take effect if the memcg is the target + * root memcg. + */ + if (root == memcg) + return 0; + if (in_low_reclaim) return READ_ONCE(memcg->memory.emin);
@@ -835,7 +843,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { }
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, + struct mem_cgroup *memcg, bool in_low_reclaim) { return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index b06868fc4926..ad2782f754ab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2346,9 +2346,9 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long protection;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); - protection = mem_cgroup_protection(memcg, + protection = mem_cgroup_protection(sc->target_mem_cgroup, + memcg, sc->memcg_low_reclaim); - if (protection) { /* * Scale a cgroup's reclaim pressure by proportioning
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Thanks.
On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Hi Chris!
So the thing is that emin/elow cached values are shared between global and targeted (caused by memory.max) reclaim. It's racy by design, but in general it should work ok, because in the end we'll reclaim or not approximately the same amount of memory.
In the case which Yafang described, the emin value calculated in the process of the global reclaim leads to a slowdown of the targeted reclaim. It's not a tragedy, but not perfect too. It seems that the proposed patch makes it better, and as now I don't see any bad consequences.
Thanks!
On Fri, Apr 24, 2020 at 5:13 AM Roman Gushchin guro@fb.com wrote:
On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Hi Chris!
So the thing is that emin/elow cached values are shared between global and targeted (caused by memory.max) reclaim. It's racy by design, but in general it should work ok, because in the end we'll reclaim or not approximately the same amount of memory.
In the case which Yafang described, the emin value calculated in the process of the global reclaim leads to a slowdown of the targeted reclaim. It's not a tragedy, but not perfect too. It seems that the proposed patch makes it better, and as now I don't see any bad consequences.
Thanks for your explanation. Your explanation make the issue more clear. If you don't mind, I will copy some of your comments to improve the commit log.
On Thu 23-04-20 14:13:19, Roman Gushchin wrote:
On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Hi Chris!
So the thing is that emin/elow cached values are shared between global and targeted (caused by memory.max) reclaim. It's racy by design, but in general it should work ok, because in the end we'll reclaim or not approximately the same amount of memory.
In the case which Yafang described, the emin value calculated in the process of the global reclaim leads to a slowdown of the targeted reclaim. It's not a tragedy, but not perfect too. It seems that the proposed patch makes it better, and as now I don't see any bad consequences.
Do we have any means to quantify the effect?
I do understand the racy nature of the effective protection values. We do update them in mem_cgroup_protected and that handles the reclaim_target == memcg case already. So why do we care later on in mem_cgroup_protection? And why don't we care about any other concurrent reclaimers which have a different reclaim memcg target? This just doesn't make any sense.
Either we do care about races because they are harmful and then we care for all possible case or we don't and this patch doesn't really any big value. Or I still miss the point.
On Fri, Apr 24, 2020 at 6:40 PM Michal Hocko mhocko@kernel.org wrote:
On Thu 23-04-20 14:13:19, Roman Gushchin wrote:
On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Hi Chris!
So the thing is that emin/elow cached values are shared between global and targeted (caused by memory.max) reclaim. It's racy by design, but in general it should work ok, because in the end we'll reclaim or not approximately the same amount of memory.
In the case which Yafang described, the emin value calculated in the process of the global reclaim leads to a slowdown of the targeted reclaim. It's not a tragedy, but not perfect too. It seems that the proposed patch makes it better, and as now I don't see any bad consequences.
Do we have any means to quantify the effect?
I do understand the racy nature of the effective protection values. We do update them in mem_cgroup_protected and that handles the reclaim_target == memcg case already. So why do we care later on in mem_cgroup_protection? And why don't we care about any other concurrent reclaimers which have a different reclaim memcg target? This just doesn't make any sense.
No, you missed the point. The issue pointed by me isn't related with racy. Roman also explained that the racy is not the point.
Either we do care about races because they are harmful and then we care for all possible case or we don't and this patch doesn't really any big value. Or I still miss the point.
The real point is memcg relcaim will get a wrong value from mem_cgroup_protection() after memory.emin is set. Suppose target memcg has memory.current is 2G and memory.min is 2G, in memcg reclaim, the scan count will be (SWAP_CLUSTER_MAX>>sc->priority), rather than (lruvec_size >> sc->priority). That's a slowdown, and that's explained by Roman as well.
On Thu, Apr 23, 2020 at 11:33 PM Chris Down chris@chrisdown.name wrote:
Hi Yafang,
I'm afraid I'm just as confused as Michal was about the intent of this patch.
Can you please be more concise and clear about the practical ramifications and demonstrate some pathological behaviour? I really can't visualise what's wrong here from your explanation, and if I can't understand it as the person who wrote this code, I am not surprised others are also confused :-)
Hi Chris,
If the author can't understand deeply in the code worte by himself/herself, I think the author should do more test on his/her patches. Regarding the issue in this case, my understanding is you know the benefit of proportional reclaim, but I'm wondering that do you know the loss if the proportional is not correct ? I don't mean to affend you, while I just try to explain how the community should cooperate.
Or maybe Roman can try to explain, since he acked the previous patch? At least to me, the emin/elow behaviour seems fairly non-trivial to reason about right now.
Yafang Shao writes:
If the author can't understand deeply in the code worte by himself/herself, I think the author should do more test on his/her patches. Regarding the issue in this case, my understanding is you know the benefit of proportional reclaim, but I'm wondering that do you know the loss if the proportional is not correct ? I don't mean to affend you, while I just try to explain how the community should cooperate.
I'm pretty sure that since multiple people on mm list have already expressed confusion at this patch, this isn't a question of testing, but of lack of clarity in usage :-)
Promoting "testing" as a panacea for this issue misses a significant part of the real problem: that the intended semantics and room for allowed races is currently unclear, which is why there is a general sense of confusion around your proposed patch and what it solves. If more testing would help, then the benefit of your patch should be patently obvious -- but it isn't.
On Fri, Apr 24, 2020 at 8:18 PM Chris Down chris@chrisdown.name wrote:
Yafang Shao writes:
If the author can't understand deeply in the code worte by himself/herself, I think the author should do more test on his/her patches. Regarding the issue in this case, my understanding is you know the benefit of proportional reclaim, but I'm wondering that do you know the loss if the proportional is not correct ? I don't mean to affend you, while I just try to explain how the community should cooperate.
I'm pretty sure that since multiple people on mm list have already expressed confusion at this patch, this isn't a question of testing, but of lack of clarity in usage :-)
Promoting "testing" as a panacea for this issue misses a significant part of the real problem: that the intended semantics and room for allowed races is currently unclear, which is why there is a general sense of confusion around your proposed patch and what it solves. If more testing would help, then the benefit of your patch should be patently obvious -- but it isn't.
I have shown a testcase in my commit log. Bellow is the result without my patch,
[ 601.811428] vmscan: protection 1048576 memcg /foo target memcg /foo [ 601.811429] vmscan: [ 601.811429] vmscan: protection 1048576 memcg /foo target memcg /foo [ 601.811430] vmscan: [ 601.811430] vmscan: protection 1048576 memcg /foo target memcg /foo [ 601.811431] vmscan: [ 602.452791] vmscan: protection 1048576 memcg /foo target memcg /foo [ 602.452795] vmscan: [ 602.452796] vmscan: protection 1048576 memcg /foo target memcg /foo [ 602.452805] vmscan: [ 602.452805] vmscan: protection 1048576 memcg /foo target memcg /foo [ 602.452806] vmscan: [ 602.452807] vmscan: protection 1048576 memcg /foo target memcg /foo [ 602.452808] vmscan:
Here's patch to print the above info.
diff --git a/mm/vmscan.c b/mm/vmscan.c index b06868fc4926..7525665d7cec 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2344,10 +2344,18 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long lruvec_size; unsigned long scan; unsigned long protection; + struct mem_cgroup *target = sc->target_mem_cgroup;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); protection = mem_cgroup_protection(memcg, sc->memcg_low_reclaim); + if (memcg && memcg != root_mem_cgroup && target) { + pr_info("protection %lu memcg ", protection); + pr_cont_cgroup_path(memcg->css.cgroup); + pr_cont(" target memcg "); + pr_cont_cgroup_path(target->css.cgroup); + pr_info("\n"); + }
if (protection) {
So my question is that do you think the protection in these log is okay ? Can you answer me ?
Hint: what should protection be if memcg is the target memcg ?
I'm not debating whether your test case is correct or not, or whether the numbers are correct or not. The point is that 3 out of 4 people from mm list who have looked at this patch have no idea what it's trying to do, why it's important, or why these numbers *should* necessarily be wrong.
It's good that you have provided a way to demonstrate the effects of your changes, but one can't solve cognitive gaps with testing. If one could, then things like TDD would be effective on complex projects -- but they aren't[0].
We're really getting off in the weeds here, though. I just want to make it clear to you that if you think "more testing" is a solution to kernel ailments like this, you're going to be disappointed.
0: http://people.brunel.ac.uk/~csstmms/FucciEtAl_ESEM2016.pdf
On Fri, Apr 24, 2020 at 9:05 PM Chris Down chris@chrisdown.name wrote:
I'm not debating whether your test case is correct or not, or whether the numbers are correct or not. The point is that 3 out of 4 people from mm list who have looked at this patch have no idea what it's trying to do, why it's important, or why these numbers *should* necessarily be wrong.
It's good that you have provided a way to demonstrate the effects of your changes, but one can't solve cognitive gaps with testing. If one could, then things like TDD would be effective on complex projects -- but they aren't[0].
We're really getting off in the weeds here, though. I just want to make it clear to you that if you think "more testing" is a solution to kernel ailments like this, you're going to be disappointed.
0: http://people.brunel.ac.uk/~csstmms/FucciEtAl_ESEM2016.pdf
Hi Chris,
Pls. answer my question directly - what should protection be if memcg is the target memcg ? If you can't answer my quesiont, I suggest to revist your patch.
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Here's the explanation of this issue. memory.{low,min} won't take effect if the to-be-reclaimed memcg is the sc->target_mem_cgroup, that can also be proved by the implementation in mem_cgroup_protected(), see bellow, mem_cgroup_protected if (memcg == root) [2] return MEMCG_PROT_NONE;
But this rule is ignored in mem_cgroup_protection(), which will read memory.{emin, elow} as the protection whatever the memcg is.
How would this issue happen? Because in mem_cgroup_protected() we forget to clear the memory.{emin, elow} if the memcg is target_mem_cgroup [2].
An example to illustrate this issue. root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 800M ('current' must be greater than 'min') Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A. Then kswapd stops. As a result of it, the memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 512M (approximately) memory.emin: 512M
Then a new workload starts to run in memcg A, and it will trigger memcg relcaim in A soon. As memcg A is the target_mem_cgroup of this reclaimer, so it return directly without touching memory.{emin, elow}.[2] The memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 1024M (approximately) memory.emin: 512M Then this memory.emin will be used in mem_cgroup_protection() to get the scan count, which is obvoiusly a wrong scan count.
[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Cc: Chris Down chris@chrisdown.name Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org Signed-off-by: Yafang Shao laoar.shao@gmail.com
include/linux/memcontrol.h | 13 +++++++++++-- mm/vmscan.c | 4 ++-- 2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d275c72c4f8e..114cfe06bf60 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -344,12 +344,20 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg, bool in_low_reclaim)
I'd rename "root" to "target", maybe it will make the whole thing more clear.
I'll think a bit more about it, but at the first glance the patch looks sane to me.
Thanks!
On Fri, Apr 24, 2020 at 5:06 AM Roman Gushchin guro@fb.com wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Here's the explanation of this issue. memory.{low,min} won't take effect if the to-be-reclaimed memcg is the sc->target_mem_cgroup, that can also be proved by the implementation in mem_cgroup_protected(), see bellow, mem_cgroup_protected if (memcg == root) [2] return MEMCG_PROT_NONE;
But this rule is ignored in mem_cgroup_protection(), which will read memory.{emin, elow} as the protection whatever the memcg is.
How would this issue happen? Because in mem_cgroup_protected() we forget to clear the memory.{emin, elow} if the memcg is target_mem_cgroup [2].
An example to illustrate this issue. root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 800M ('current' must be greater than 'min') Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A. Then kswapd stops. As a result of it, the memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 512M (approximately) memory.emin: 512M
Then a new workload starts to run in memcg A, and it will trigger memcg relcaim in A soon. As memcg A is the target_mem_cgroup of this reclaimer, so it return directly without touching memory.{emin, elow}.[2] The memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 1024M (approximately) memory.emin: 512M Then this memory.emin will be used in mem_cgroup_protection() to get the scan count, which is obvoiusly a wrong scan count.
[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Cc: Chris Down chris@chrisdown.name Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org Signed-off-by: Yafang Shao laoar.shao@gmail.com
include/linux/memcontrol.h | 13 +++++++++++-- mm/vmscan.c | 4 ++-- 2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d275c72c4f8e..114cfe06bf60 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -344,12 +344,20 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); }
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg, bool in_low_reclaim)
I'd rename "root" to "target", maybe it will make the whole thing more clear.
That would make it better. I will change it.
I'll think a bit more about it, but at the first glance the patch looks sane to me.
Thanks!
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (!root) root = root_mem_cgroup; - if (memcg == root) + if (memcg == root) { + /* + * The cgroup is the reclaim root in this reclaim + * cycle, and therefore not protected. But it may have + * stale effective protection values from previous + * cycles in which it was not the reclaim root - for + * example, global reclaim followed by limit reclaim. + * Reset these values for mem_cgroup_protection(). + */ + memcg->memory.emin = 0; + memcg->memory.elow = 0; return MEMCG_PROT_NONE; + }
usage = page_counter_read(&memcg->memory); if (!usage)
Here's the explanation of this issue. memory.{low,min} won't take effect if the to-be-reclaimed memcg is the sc->target_mem_cgroup, that can also be proved by the implementation in mem_cgroup_protected(), see bellow, mem_cgroup_protected if (memcg == root) [2] return MEMCG_PROT_NONE;
But this rule is ignored in mem_cgroup_protection(), which will read memory.{emin, elow} as the protection whatever the memcg is.
How would this issue happen? Because in mem_cgroup_protected() we forget to clear the memory.{emin, elow} if the memcg is target_mem_cgroup [2].
An example to illustrate this issue. root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 800M ('current' must be greater than 'min') Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A. Then kswapd stops. As a result of it, the memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 512M (approximately) memory.emin: 512M
Then a new workload starts to run in memcg A, and it will trigger memcg relcaim in A soon. As memcg A is the target_mem_cgroup of this reclaimer, so it return directly without touching memory.{emin, elow}.[2] The memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 1024M (approximately) memory.emin: 512M Then this memory.emin will be used in mem_cgroup_protection() to get the scan count, which is obvoiusly a wrong scan count.
[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Cc: Chris Down chris@chrisdown.name Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org Signed-off-by: Yafang Shao laoar.shao@gmail.com
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
Btw, I think there is opportunity to make this a bit less error prone.
We have a mem_cgroup_protected() that returns yes or no, essentially, but protection isn't a binary state anymore.
It's also been a bit iffy that it looks like a simple predicate function, but it indeed needs to run procedurally for each cgroup in order for the calculations throughout the tree to be correct.
It might be better to have a
mem_cgroup_calculate_protection()
that runs for every cgroup we visit and sets up the internal state; then have more self-explanatory query functions on top of that:
mem_cgroup_below_min() mem_cgroup_below_low() mem_cgroup_protection()
What do you guys think?
diff --git a/mm/vmscan.c b/mm/vmscan.c index e0f502b5fca6..dbd3f75d39b9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2615,14 +2615,15 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned;
- switch (mem_cgroup_protected(target_memcg, memcg)) { - case MEMCG_PROT_MIN: + mem_cgroup_calculate_protection(target_memcg, memcg); + + if (mem_cgroup_below_min(memcg)) { /* * Hard protection. * If there is no reclaimable memory, OOM. */ continue; - case MEMCG_PROT_LOW: + } else if (mem_cgroup_below_low(memcg)) { /* * Soft protection. * Respect the protection only as long as @@ -2634,16 +2635,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) continue; } memcg_memory_event(memcg, MEMCG_LOW); - break; - case MEMCG_PROT_NONE: - /* - * All protection thresholds breached. We may - * still choose to vary the scan pressure - * applied based on by how much the cgroup in - * question has exceeded its protection - * thresholds (see get_scan_count). - */ - break; }
reclaimed = sc->nr_reclaimed;
On Fri 24-04-20 09:44:38, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
Btw, I think there is opportunity to make this a bit less error prone.
We have a mem_cgroup_protected() that returns yes or no, essentially, but protection isn't a binary state anymore.
It's also been a bit iffy that it looks like a simple predicate function, but it indeed needs to run procedurally for each cgroup in order for the calculations throughout the tree to be correct.
It might be better to have a
mem_cgroup_calculate_protection()
that runs for every cgroup we visit and sets up the internal state; then have more self-explanatory query functions on top of that:
mem_cgroup_below_min() mem_cgroup_below_low() mem_cgroup_protection()
What do you guys think?
Fully agreed. I have to say that I have considered the predicate with side effects really confusing.
On Fri, Apr 24, 2020 at 9:44 PM Johannes Weiner hannes@cmpxchg.org wrote:
On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
Btw, I think there is opportunity to make this a bit less error prone.
We have a mem_cgroup_protected() that returns yes or no, essentially, but protection isn't a binary state anymore.
It's also been a bit iffy that it looks like a simple predicate function, but it indeed needs to run procedurally for each cgroup in order for the calculations throughout the tree to be correct.
It might be better to have a
mem_cgroup_calculate_protection()
that runs for every cgroup we visit and sets up the internal state; then have more self-explanatory query functions on top of that:
mem_cgroup_below_min() mem_cgroup_below_low() mem_cgroup_protection()
What do you guys think?
diff --git a/mm/vmscan.c b/mm/vmscan.c index e0f502b5fca6..dbd3f75d39b9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2615,14 +2615,15 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned;
switch (mem_cgroup_protected(target_memcg, memcg)) {
case MEMCG_PROT_MIN:
mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg)) { /* * Hard protection. * If there is no reclaimable memory, OOM. */ continue;
case MEMCG_PROT_LOW:
} else if (mem_cgroup_below_low(memcg)) { /* * Soft protection. * Respect the protection only as long as
@@ -2634,16 +2635,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) continue; } memcg_memory_event(memcg, MEMCG_LOW);
break;
case MEMCG_PROT_NONE:
/*
* All protection thresholds breached. We may
* still choose to vary the scan pressure
* applied based on by how much the cgroup in
* question has exceeded its protection
* thresholds (see get_scan_count).
*/
break; } reclaimed = sc->nr_reclaimed;
After my revist of the memcg protection. I have another idea.
The emin and elow is not decided by the memcg(struct mem_cgroup), but they are really decided by the reclaim context(struct srhink_control). So they should not be bound into struct mem_cgroup, while they are really should be bound into struct srhink_control. IOW, we should move emin and elow from struct mem_cgroup into struct srhink_control. And they two members in shrink_control will be updated when a new memcg is to be shrinked. I haven't thought it deeply, but I think this should be the right thing to do.
Thanks Yafang
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
[...]
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
This is better. Thanks!
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
I do not see how this would lead all the way to OOM killer but it certainly can lead to unnecessary increase of the reclaim priority. The smaller the difference between the reclaim target and protection the more visible the effect would be. But if there are reclaimable pages then the reclaim should see them sooner or later
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
And the query function has to know additional rules about when the emin/elow values are uptodate or it could silently be looking at stale data, which isn't very robust.
"The effective protection values are uptodate after calling mem_cgroup_protected() inside the reclaim cycle - UNLESS the group you're looking at happens to be..."
It's much easier to make the rule: The values are uptodate after you called mem_cgroup_protected().
Or mem_cgroup_calculate_protection(), if we go with that later.
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
This is better. Thanks!
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
I do not see how this would lead all the way to OOM killer but it certainly can lead to unnecessary increase of the reclaim priority. The smaller the difference between the reclaim target and protection the more visible the effect would be. But if there are reclaimable pages then the reclaim should see them sooner or later
It would be a pretty extreme case, but not impossible AFAICS, because OOM is just a sampled state, not deterministic.
If memory.max is 64G and memory.low is 64G minus one page, this bug could cause limit reclaim to look at no more than SWAP_CLUSTER_MAX pages at priority 0. It's possible it wouldn't get through the full 64G worth of memory before giving up and declaring OOM.
Not that that would be a sensical configuration... My point is that OOM is defined as "I've looked at X pages and found nothing" and this bug can significantly lower X.
On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
Yes this is true but it is also potentially overwriting the state with a parallel reclaim which can lead to surprising results beacause parent's effective protection is used to define protection distribution for children. Let's have global and A's reclaim in parallel: | A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G) |\ | C (low = 1G, usage = 2.5G) B (low = 1G, usage = 0.5G)
for A reclaim we have B.elow = B.low C.elow = C.low
For the global reclaim A.elow = A.low B.elow = min(B.usage, B.low) because children_low_usage <= A.elow C.elow = min(C.usage, C.low)
With the effective values reseting we have A reclaim A.elow = 0 B.elow = B.low C.elow = C.low [...]
and global reclaim could see the above and then B.elow = C.elow = 0 because children_low_usage > A.elow
And the query function has to know additional rules about when the emin/elow values are uptodate or it could silently be looking at stale data, which isn't very robust.
"The effective protection values are uptodate after calling mem_cgroup_protected() inside the reclaim cycle - UNLESS the group you're looking at happens to be..."
It's much easier to make the rule: The values are uptodate after you called mem_cgroup_protected().
Or mem_cgroup_calculate_protection(), if we go with that later.
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
This is better. Thanks!
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
I do not see how this would lead all the way to OOM killer but it certainly can lead to unnecessary increase of the reclaim priority. The smaller the difference between the reclaim target and protection the more visible the effect would be. But if there are reclaimable pages then the reclaim should see them sooner or later
It would be a pretty extreme case, but not impossible AFAICS, because OOM is just a sampled state, not deterministic.
If memory.max is 64G and memory.low is 64G minus one page, this bug could cause limit reclaim to look at no more than SWAP_CLUSTER_MAX pages at priority 0. It's possible it wouldn't get through the full 64G worth of memory before giving up and declaring OOM.
Yes, my bad I didn't really realize that there won't be a full scan even under priority 0.
On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
Yes this is true but it is also potentially overwriting the state with a parallel reclaim which can lead to surprising results
Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
root `- A (low=2G, elow=2G, max=3G) `- A1 (low=2G, elow=2G)
If A does limit reclaim while global reclaim races, the memcg == root check in mem_cgroup_protection() will reliably calculate the "right" scan value for A, which has no pages, and the wrong scan value for A1 where the memory actually is.
I'm okay with fixing the case where a really old left-over value is used by target reclaim.
I don't see a point in special casing this one instance of a fundamental race condition at the expense of less robust code.
On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
Yes this is true but it is also potentially overwriting the state with a parallel reclaim which can lead to surprising results
Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
root `- A (low=2G, elow=2G, max=3G) `- A1 (low=2G, elow=2G)
If A does limit reclaim while global reclaim races, the memcg == root check in mem_cgroup_protection() will reliably calculate the "right" scan value for A, which has no pages, and the wrong scan value for A1 where the memory actually is.
I am sorry but I do not see how A1 would get wrong scan value. - Global reclaim - A.elow = 2G - A1.elow = min(A1.low, A1.usage) ; if (A.children_low_usage < A.elow)
- A reclaim. - A.elow = stale/undefined - A1.elow = A1.low
if mem_cgroup_protection returns 0 for A's reclaim targeting A (assuming the check is there) then not a big deal as there are no pages there as you say.
Let's compare the GR (global reclaim), AR (A reclaim). GR(A1.elow) <= AR(A1.elow) by definition, right? For A1.low overcommitted we have min(A1.low, A1.usage) * A.elow / A.children_low_usage <= min(A1.low, A1.usage) because A.elow <= A.children_low_usage
so in both cases we have GR(A1.elow) <= AR(A1.elow) which means that racing reclaims will behave sanely because the protection for the external pressure pressure is not violated. A is going to reclaim A1 less than the global reclaim but that should be OK.
Or what do I miss?
I'm okay with fixing the case where a really old left-over value is used by target reclaim.
I don't see a point in special casing this one instance of a fundamental race condition at the expense of less robust code.
I am definitely not calling to fragment the code. I do agree that having a special case in mem_cgroup_protection is quite non-intuitive. The existing code is quite hard to reason about in its current form as we can see. If we can fix all that in mem_cgroup_protected then no objections from me at all.
On Mon, Apr 27, 2020 at 4:25 PM Michal Hocko mhocko@kernel.org wrote:
On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote: > This patch is an improvement of a previous version[1], as the previous > version is not easy to understand. > This issue persists in the newest kernel, I have to resend the fix. As > the implementation is changed, I drop Roman's ack from the previous > version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (!root) root = root_mem_cgroup;
if (memcg == root)
if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
memcg->memory.elow = 0; return MEMCG_PROT_NONE;
}
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
Yes this is true but it is also potentially overwriting the state with a parallel reclaim which can lead to surprising results
Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
root `- A (low=2G, elow=2G, max=3G) `- A1 (low=2G, elow=2G)
If A does limit reclaim while global reclaim races, the memcg == root check in mem_cgroup_protection() will reliably calculate the "right" scan value for A, which has no pages, and the wrong scan value for A1 where the memory actually is.
I am sorry but I do not see how A1 would get wrong scan value.
Global reclaim
- A.elow = 2G
- A1.elow = min(A1.low, A1.usage) ; if (A.children_low_usage < A.elow)
A reclaim.
- A.elow = stale/undefined
- A1.elow = A1.low
if mem_cgroup_protection returns 0 for A's reclaim targeting A (assuming the check is there) then not a big deal as there are no pages there as you say.
Let's compare the GR (global reclaim), AR (A reclaim). GR(A1.elow) <= AR(A1.elow) by definition, right? For A1.low overcommitted we have min(A1.low, A1.usage) * A.elow / A.children_low_usage <= min(A1.low, A1.usage) because A.elow <= A.children_low_usage
so in both cases we have GR(A1.elow) <= AR(A1.elow) which means that racing reclaims will behave sanely because the protection for the external pressure pressure is not violated. A is going to reclaim A1 less than the global reclaim but that should be OK.
Or what do I miss?
I'm okay with fixing the case where a really old left-over value is used by target reclaim.
I don't see a point in special casing this one instance of a fundamental race condition at the expense of less robust code.
I am definitely not calling to fragment the code. I do agree that having a special case in mem_cgroup_protection is quite non-intuitive. The existing code is quite hard to reason about in its current form as we can see. If we can fix all that in mem_cgroup_protected then no objections from me at all.
Hi Michal,
Pls. help take a look at my refactor on proportional memcg protection[1]. I think my new patchset can fix all that in mem_cgroup_protected and make the existing code clear.
[1]. https://lore.kernel.org/linux-mm/20200425152418.28388-1-laoar.shao@gmail.com...
On Mon, Apr 27, 2020 at 10:25:24AM +0200, Michal Hocko wrote:
On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote: > This patch is an improvement of a previous version[1], as the previous > version is not easy to understand. > This issue persists in the newest kernel, I have to resend the fix. As > the implementation is changed, I drop Roman's ack from the previous > version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Because now both mem_cgroup_protection() and mem_cgroup_protected() have to know about the reclaim root semantics, instead of just the one central place.
Yes this is true but it is also potentially overwriting the state with a parallel reclaim which can lead to surprising results
Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
root `- A (low=2G, elow=2G, max=3G) `- A1 (low=2G, elow=2G)
If A does limit reclaim while global reclaim races, the memcg == root check in mem_cgroup_protection() will reliably calculate the "right" scan value for A, which has no pages, and the wrong scan value for A1 where the memory actually is.
I am sorry but I do not see how A1 would get wrong scan value.
I mistyped the example. If we're in limit reclaim in A, elow should look like this:
root `- A (low=2G, max=3G -> elow=0) `- A1 (low=0G -> elow=0)
But if global reclaim were to kick in, it could overwrite elow to this:
root `- A (low=2G, max=3G -> elow=2G) `- A1 (low=0G -> elow=2G)
(This is with the recursive memory.low semantics, of course.)
I'm okay with fixing the case where a really old left-over value is used by target reclaim.
I don't see a point in special casing this one instance of a fundamental race condition at the expense of less robust code.
I am definitely not calling to fragment the code. I do agree that having a special case in mem_cgroup_protection is quite non-intuitive. The existing code is quite hard to reason about in its current form as we can see. If we can fix all that in mem_cgroup_protected then no objections from me at all.
Agreed, sounds reasonable.
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
return MEMCG_PROT_NONE;memcg->memory.elow = 0;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Hm, I think I like the new version better, because it feels "safer" in terms of preserving sane effective protection values for concurrent reclaimers.
[...]
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
This is better. Thanks!
+1
and I like the proposed renaming/cleanup. Thanks, Johannes!
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
I do not see how this would lead all the way to OOM killer but it certainly can lead to unnecessary increase of the reclaim priority. The smaller the difference between the reclaim target and protection the more visible the effect would be. But if there are reclaimable pages then the reclaim should see them sooner or later
I guess if all memory is protected by emin and the targeted reclaim will be unable to reclaim anything, OOM can be triggered.
Btw, I wonder if this case can be covered by a new memcg kselftest? I'm not sure it can be easily reproducible, but if it can, it would be the best demonstration of a problem and the fix. Yafang, don't you want to try?
Thanks!
On Sat, Apr 25, 2020 at 12:22 AM Roman Gushchin guro@fb.com wrote:
On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (!root) root = root_mem_cgroup;
- if (memcg == root)
- if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
memcg->memory.elow = 0; return MEMCG_PROT_NONE;
- }
Could you be more specific why you prefer this over the mem_cgroup_protection which doesn't change the effective value? Isn't it easier to simply ignore effective value for the reclaim roots?
Hm, I think I like the new version better, because it feels "safer" in terms of preserving sane effective protection values for concurrent reclaimers.
[...]
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
This is better. Thanks!
+1
and I like the proposed renaming/cleanup. Thanks, Johannes!
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
I do not see how this would lead all the way to OOM killer but it certainly can lead to unnecessary increase of the reclaim priority. The smaller the difference between the reclaim target and protection the more visible the effect would be. But if there are reclaimable pages then the reclaim should see them sooner or later
I guess if all memory is protected by emin and the targeted reclaim will be unable to reclaim anything, OOM can be triggered.
Btw, I wonder if this case can be covered by a new memcg kselftest? I'm not sure it can be easily reproducible, but if it can, it would be the best demonstration of a problem and the fix. Yafang, don't you want to try?
I have tried to produce the premature OOM before I send this fix, but I find that it is really not easy to produce. But if a new memcg kselftest is needed, I can try it again.
On Fri, Apr 24, 2020 at 9:14 PM Johannes Weiner hannes@cmpxchg.org wrote:
On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
This patch is an improvement of a previous version[1], as the previous version is not easy to understand. This issue persists in the newest kernel, I have to resend the fix. As the implementation is changed, I drop Roman's ack from the previous version.
Now that I understand the problem, I much prefer the previous version.
Great news that this issue is understood by one more reviewer.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 745697906ce3..2bf91ae1e640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (!root) root = root_mem_cgroup;
if (memcg == root)
if (memcg == root) {
/*
* The cgroup is the reclaim root in this reclaim
* cycle, and therefore not protected. But it may have
* stale effective protection values from previous
* cycles in which it was not the reclaim root - for
* example, global reclaim followed by limit reclaim.
* Reset these values for mem_cgroup_protection().
*/
memcg->memory.emin = 0;
memcg->memory.elow = 0; return MEMCG_PROT_NONE;
} usage = page_counter_read(&memcg->memory); if (!usage)
Here's the explanation of this issue. memory.{low,min} won't take effect if the to-be-reclaimed memcg is the sc->target_mem_cgroup, that can also be proved by the implementation in mem_cgroup_protected(), see bellow, mem_cgroup_protected if (memcg == root) [2] return MEMCG_PROT_NONE;
But this rule is ignored in mem_cgroup_protection(), which will read memory.{emin, elow} as the protection whatever the memcg is.
How would this issue happen? Because in mem_cgroup_protected() we forget to clear the memory.{emin, elow} if the memcg is target_mem_cgroup [2].
An example to illustrate this issue. root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 800M ('current' must be greater than 'min') Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A. Then kswapd stops. As a result of it, the memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 512M (approximately) memory.emin: 512M
Then a new workload starts to run in memcg A, and it will trigger memcg relcaim in A soon. As memcg A is the target_mem_cgroup of this reclaimer, so it return directly without touching memory.{emin, elow}.[2] The memory values of A will be, root_mem_cgroup / A memory.max: 1024M memory.min: 512M memory.current: 1024M (approximately) memory.emin: 512M Then this memory.emin will be used in mem_cgroup_protection() to get the scan count, which is obvoiusly a wrong scan count.
[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Cc: Chris Down chris@chrisdown.name Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org Signed-off-by: Yafang Shao laoar.shao@gmail.com
As others have noted, it's fairly hard to understand the problem from the above changelog. How about the following:
A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure.
9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings.
When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice.
Thanks a lot for your improvement on the change log.
-- Thanks Yafang
linux-stable-mirror@lists.linaro.org