On Mon 29-10-18 20:42:53, Balbir Singh wrote:
On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
[...]
These hugetlb allocations might be disruptive and that is an expected behavior because this is an explicit requirement from an admin to pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just underlines that requirement.
Yes, but in the absence of a particular node, for example via sysctl (as the compaction does), I don't think it is a hard requirement to get a page from a particular node.
Again this seems like a deliberate decision. You want your distributions as even as possible otherwise the NUMA placement will be much less deterministic. At least that was the case for a long time. If you have different per-node preferences, just use NUMA aware pre-allocation.
I agree we need __GFP_RETRY_FAIL, in any case the real root cause for me is should_reclaim_continue() which keeps the task looping without making forward progress.
This seems like a separate issue which should better be debugged. Please open a new thread describing the problem and the state of the node.