When setting the low and high watermarks we use min_wmark_pages(zone). I guess this is to reduce the line length. But we forgot that this macro includes zone->watermark_boost. We need to reset zone->watermark_boost first. Otherwise the watermarks will be set inconsistently.
E.g. this could cause inconsistent values if the watermarks have been boosted, and then you change a sysctl which triggers __setup_per_zone_wmarks().
I strongly suspect this explains why I have seen slightly high watermarks. Suspicious-looking zoneinfo below - notice high-low != low-min.
Node 0, zone Normal pages free 74597 min 9582 low 34505 high 36900
https://unix.stackexchange.com/questions/525674/my-low-and-high-watermarks-s...
Signed-off-by: Alan Jenkins alan.christopher.jenkins@gmail.com Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Cc: stable@vger.kernel.org ---
Tested by compiler :-).
Ideally the commit message would be clear about what happens the *first* time __setup_per_zone_watermarks() is called. I guess that zone->watermark_boost is *usually* zero, or we would have noticed some wild problems :-). However I am not familiar with how the zone structures are allocated & initialized. Maybe there is a case where zone->watermark_boost could contain an arbitrary unitialized value at this point. Can we rule that out?
mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c02cff1ed56e..db9758cda6f8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7606,9 +7606,9 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 10000));
+ zone->watermark_boost = 0; zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; - zone->watermark_boost = 0;
spin_unlock_irqrestore(&zone->lock, flags); }
On 6/21/19 1:43 PM, Alan Jenkins wrote:
When setting the low and high watermarks we use min_wmark_pages(zone). I guess this is to reduce the line length. But we forgot that this macro includes zone->watermark_boost. We need to reset zone->watermark_boost first. Otherwise the watermarks will be set inconsistently.
E.g. this could cause inconsistent values if the watermarks have been boosted, and then you change a sysctl which triggers __setup_per_zone_wmarks().
I strongly suspect this explains why I have seen slightly high watermarks. Suspicious-looking zoneinfo below - notice high-low != low-min.
Node 0, zone Normal pages free 74597 min 9582 low 34505 high 36900
https://unix.stackexchange.com/questions/525674/my-low-and-high-watermarks-s...
Signed-off-by: Alan Jenkins alan.christopher.jenkins@gmail.com Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Cc: stable@vger.kernel.org
Nice catch, thanks!
Acked-by: Vlastimil Babka vbabka@suse.cz
Personally I would implement it a bit differently, see below. If you agree, it's fine if you keep the authorship of the whole patch.
Tested by compiler :-).
Ideally the commit message would be clear about what happens the *first* time __setup_per_zone_watermarks() is called. I guess that zone->watermark_boost is *usually* zero, or we would have noticed some wild problems :-). However I am not familiar with how the zone structures are allocated & initialized. Maybe there is a case where zone->watermark_boost could contain an arbitrary unitialized value at this point. Can we rule that out?
Dunno if there's some arch override, but generic_alloc_nodedata() uses kzalloc() so it's zeroed.
-----8<----- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8abe0af..3b2f0cedf78e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7624,6 +7624,7 @@ static void __setup_per_zone_wmarks(void)
for_each_zone(zone) { u64 tmp; + unsigned long wmark_min;
spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone_managed_pages(zone); @@ -7642,13 +7643,13 @@ static void __setup_per_zone_wmarks(void)
min_pages = zone_managed_pages(zone) / 1024; min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL); - zone->_watermark[WMARK_MIN] = min_pages; + wmark_min = min_pages; } else { /* * If it's a lowmem zone, reserve a number of pages * proportionate to the zone's size. */ - zone->_watermark[WMARK_MIN] = tmp; + wmark_min = tmp; }
/* @@ -7660,8 +7661,9 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 10000));
- zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; - zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->_watermark[WMARK_MIN] = wmark_min; + zone->_watermark[WMARK_LOW] = wmark_min + tmp; + zone->_watermark[WMARK_HIGH] = wmark_min + tmp * 2; zone->watermark_boost = 0;
spin_unlock_irqrestore(&zone->lock, flags);
On Fri, Jun 21, 2019 at 02:09:31PM +0200, Vlastimil Babka wrote:
On 6/21/19 1:43 PM, Alan Jenkins wrote:
When setting the low and high watermarks we use min_wmark_pages(zone). I guess this is to reduce the line length. But we forgot that this macro includes zone->watermark_boost. We need to reset zone->watermark_boost first. Otherwise the watermarks will be set inconsistently.
E.g. this could cause inconsistent values if the watermarks have been boosted, and then you change a sysctl which triggers __setup_per_zone_wmarks().
I strongly suspect this explains why I have seen slightly high watermarks. Suspicious-looking zoneinfo below - notice high-low != low-min.
Node 0, zone Normal pages free 74597 min 9582 low 34505 high 36900
https://unix.stackexchange.com/questions/525674/my-low-and-high-watermarks-s...
Signed-off-by: Alan Jenkins alan.christopher.jenkins@gmail.com Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Cc: stable@vger.kernel.org
Nice catch, thanks!
Acked-by: Vlastimil Babka vbabka@suse.cz
Personally I would implement it a bit differently, see below. If you agree, it's fine if you keep the authorship of the whole patch.
Tested by compiler :-).
Ideally the commit message would be clear about what happens the *first* time __setup_per_zone_watermarks() is called. I guess that zone->watermark_boost is *usually* zero, or we would have noticed some wild problems :-). However I am not familiar with how the zone structures are allocated & initialized. Maybe there is a case where zone->watermark_boost could contain an arbitrary unitialized value at this point. Can we rule that out?
Dunno if there's some arch override, but generic_alloc_nodedata() uses kzalloc() so it's zeroed.
-----8<----- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8abe0af..3b2f0cedf78e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7624,6 +7624,7 @@ static void __setup_per_zone_wmarks(void) for_each_zone(zone) { u64 tmp;
unsigned long wmark_min;
spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone_managed_pages(zone); @@ -7642,13 +7643,13 @@ static void __setup_per_zone_wmarks(void) min_pages = zone_managed_pages(zone) / 1024; min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
zone->_watermark[WMARK_MIN] = min_pages;
} else { /* * If it's a lowmem zone, reserve a number of pages * proportionate to the zone's size. */wmark_min = min_pages;
zone->_watermark[WMARK_MIN] = tmp;
}wmark_min = tmp;
/* @@ -7660,8 +7661,9 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 10000));
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
zone->_watermark[WMARK_MIN] = wmark_min;
zone->_watermark[WMARK_LOW] = wmark_min + tmp;
zone->watermark_boost = 0;zone->_watermark[WMARK_HIGH] = wmark_min + tmp * 2;
Do you think this could cause a race condition between __setup_per_zone_wmarks and pgdat_watermark_boosted which checks whether the watermark_boost of each zone is non-zero? pgdat_watermark_boosted is not called with a zone lock. Here is a probable case scenario: watermarks are boosted in steal_suitable_fallback(which happens under a zone lock). After that kswapd is woken up by wakeup_kswapd(zone,0,0,zone_idx(zone)) in rmqueue without holding a zone lock. Lets say someone modified min_kfree_bytes, this would lead to all the zone->watermark_boost being set to 0. This may cause pgdat_watermark_boosted to return false, which would not wakeup kswapd as intended by boosting the watermark. This behaviour is similar to waking up kswapd for a balanced node.
Also if kswapd was woken up successfully because of watermarks being boosted. In balance_pgdat, we use nr_boost_reclaim to count number of pages to reclaim because of boosting. nr_boost_reclaim is calculated as: nr_boost_reclaim = 0; for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; if (!managed_zone(zone)) continue;
nr_boost_reclaim += zone->watermark_boost; zone_boosts[i] = zone->watermark_boost; } boosted = nr_boost_reclaim;
This is not under a zone_lock. This could lead to nr_boost_reclaim to be 0 if min_kfree_bytes is set to 0. Which would wake up kcompactd without reclaiming memory. kcompactd compaction might be spurious if the if the memory reclaim step is not happening?
Any thoughts?
spin_unlock_irqrestore(&zone->lock, flags);
On 6/21/19 4:07 PM, Bharath Vedartham wrote:
Do you think this could cause a race condition between __setup_per_zone_wmarks and pgdat_watermark_boosted which checks whether the watermark_boost of each zone is non-zero? pgdat_watermark_boosted is not called with a zone lock. Here is a probable case scenario: watermarks are boosted in steal_suitable_fallback(which happens under a zone lock). After that kswapd is woken up by wakeup_kswapd(zone,0,0,zone_idx(zone)) in rmqueue without holding a zone lock. Lets say someone modified min_kfree_bytes, this would lead to all the zone->watermark_boost being set to 0. This may cause pgdat_watermark_boosted to return false, which would not wakeup kswapd as intended by boosting the watermark. This behaviour is similar to waking up kswapd for a balanced node.
Not waking up kswapd shouldn't cause a significant trouble.
Also if kswapd was woken up successfully because of watermarks being boosted. In balance_pgdat, we use nr_boost_reclaim to count number of pages to reclaim because of boosting. nr_boost_reclaim is calculated as: nr_boost_reclaim = 0; for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; if (!managed_zone(zone)) continue;
nr_boost_reclaim += zone->watermark_boost; zone_boosts[i] = zone->watermark_boost; } boosted = nr_boost_reclaim;
This is not under a zone_lock. This could lead to nr_boost_reclaim to be 0 if min_kfree_bytes is set to 0. Which would wake up kcompactd without reclaiming memory.
Setting min_kfree_bytes to 0 is asking for problems regardless of this check. Much more trouble than waking up kcompactd spuriously, which is just a few wasted cpu cycles.
kcompactd compaction might be spurious if the if the memory reclaim step is not happening?
Any thoughts?
Unless the races cause either some data corruption, or e.g. spurious allocation failures, I don't think they are worth adding new spinlock sections.
Thanks, Vlastimil
spin_unlock_irqrestore(&zone->lock, flags);
On Fri, Jun 21, 2019 at 12:43:25PM +0100, Alan Jenkins wrote:
When setting the low and high watermarks we use min_wmark_pages(zone). I guess this is to reduce the line length. But we forgot that this macro includes zone->watermark_boost. We need to reset zone->watermark_boost first. Otherwise the watermarks will be set inconsistently.
E.g. this could cause inconsistent values if the watermarks have been boosted, and then you change a sysctl which triggers __setup_per_zone_wmarks().
I strongly suspect this explains why I have seen slightly high watermarks. Suspicious-looking zoneinfo below - notice high-low != low-min.
Node 0, zone Normal pages free 74597 min 9582 low 34505 high 36900
https://unix.stackexchange.com/questions/525674/my-low-and-high-watermarks-s...
Signed-off-by: Alan Jenkins alan.christopher.jenkins@gmail.com Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Cc: stable@vger.kernel.org
Either way
Acked-by: Mel Gorman mgorman@techsingularity.net
linux-stable-mirror@lists.linaro.org