On Wed, Feb 08, 2012 at 04:14:46PM +0100, Marek Szyprowski wrote:
<SNIP> +static int __reclaim_pages(struct zone *zone, gfp_t gfp_mask, int count) +{ + enum zone_type high_zoneidx = gfp_zone(gfp_mask); + struct zonelist *zonelist = node_zonelist(0, gfp_mask); + int did_some_progress = 0; + int order = 1; + unsigned long watermark; + + /* + * Increase level of watermarks to force kswapd do his job + * to stabilize at new watermark level. + */ + min_free_kbytes += count * PAGE_SIZE / 1024;
There is a risk of overflow here although it is incredibly small. Still, a potentially nicer way of doing this was
count << (PAGE_SHIFT - 10)
- setup_per_zone_wmarks();
Nothing prevents two or more processes updating the wmarks at the same time which is racy and unpredictable. Today it is not much of a problem but CMA makes this path hotter than it was and you may see weirdness if two processes are updating zonelists at the same time. Swap-over-NFS actually starts with a patch that serialises setup_per_zone_wmarks()
You also potentially have a BIG problem here if this happens
min_free_kbytes = 32768 Process a: min_free_kbytes += 65536 Process a: start direct reclaim echo 16374 > /proc/sys/vm/min_free_kbytes Process a: exit direct_reclaim Process a: min_free_kbytes -= 65536
min_free_kbytes now wraps negative and the machine hangs.
The damage is confined to CMA though so I am not going to lose sleep over it but you might want to consider at least preventing parallel updates to min_free_kbytes from proc.
Right. This approach was definitely too hacky. What do you think about replacing it with the following code (I assume that setup_per_zone_wmarks() serialization patch will be merged anyway so I skipped it here):
It's part of a larger series and the rest of that series is controversial. That single patch can be split out obviously so feel free to add it to your series and stick your Signed-off-by on the end of it.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 82f4fa5..bb9ae41 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -371,6 +371,13 @@ struct zone { /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif +#ifdef CONFIG_CMA
/*
* CMA needs to increase watermark levels during the allocation
* process to make sure that the system is not starved.
*/
unsigned long min_cma_pages;
+#endif struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 824fb37..1ca52f0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5044,6 +5044,11 @@ void setup_per_zone_wmarks(void)
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2); zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
+#ifdef CONFIG_CMA
zone->watermark[WMARK_MIN] += zone->min_cma_pages;
zone->watermark[WMARK_LOW] += zone->min_cma_pages;
zone->watermark[WMARK_HIGH] += zone->min_cma_pages;
+#endif setup_zone_migrate_reserve(zone); spin_unlock_irqrestore(&zone->lock, flags); }
This is better in that it is not vunerable to parallel updates of min_free_kbytes. It would be slightly tidier to introduce something like cma_wmark_pages() that returns min_cma_pages if CONFIG_CMA and 0 otherwise. Use the helper to get right of this ifdef CONFIG_CMA within setup_per_zone_wmarks().
You'll still have the problem of kswapd not taking CMA pages properly into account when deciding whether to reclaim or not though.