On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
On Mon 29-10-18 16:17:52, Balbir Singh wrote: [...]
I wonder if alloc_pool_huge_page() should also trim out it's logic of __GFP_THISNODE for the same reasons as mentioned here. I like that we round robin to alloc the pool pages, but __GFP_THISNODE might be an overkill for that case as well.
alloc_pool_huge_page uses __GFP_THISNODE for a different reason than THP. We really do want to allocated for a per-node pool. THP can fallback or use a different node.
Not really
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) ... gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; ... for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); if (page) break; }
The code just tries to distribute the pool
These hugetlb allocations might be disruptive and that is an expected behavior because this is an explicit requirement from an admin to pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just underlines that requirement.
Yes, but in the absence of a particular node, for example via sysctl (as the compaction does), I don't think it is a hard requirement to get a page from a particular node. I agree we need __GFP_RETRY_FAIL, in any case the real root cause for me is should_reclaim_continue() which keeps the task looping without making forward progress.
The __GFP_THISNODE was again an example of mis-leading the allocator in this case, IMHO.
Maybe the compaction logic could be improved and that might be a shared goal with future changes though.
I'll also send my RFC once my testing is done, assuming I get it to reproduce with a desired frequency.
Balbir Singh.