On Wed, 10 Oct 2018, David Rientjes wrote:
I think "madvise vs mbind" is more an issue of "no-permission vs permission" required. And if the processes ends up swapping out all other process with their memory already allocated in the node, I think some permission is correct to be required, in which case an mbind looks a better fit. MPOL_PREFERRED also looks a first candidate for investigation as it's already not black and white and allows spillover and may already do the right thing in fact if set on top of MADV_HUGEPAGE.
We would never want to thrash the local node for hugepages because there is no guarantee that any swapping is useful. On COMPACT_SKIPPED due to low memory, we have very clear evidence that pageblocks are already sufficiently fragmented by unmovable pages such that compaction itself, even with abundant free memory, fails to free an entire pageblock due to the allocator's preference to fragment pageblocks of fallback migratetypes over returning remote free memory.
As I've stated, we do not want to reclaim pointlessly when compaction is unable to access the freed memory or there is no guarantee it can free an entire pageblock. Doing so allows thrashing of the local node, or remote nodes if __GFP_THISNODE is removed, and the hugepage still cannot be allocated. If this proposed mbind() that requires permissions is geared to me as the user, I'm afraid the details of what leads to the thrashing are not well understood because I certainly would never use this.
At the risk of beating a dead horse that has already been beaten, what are the plans for this patch when the merge window opens? It would be rather unfortunate for us to start incurring a 14% increase in access latency and 40% increase in fault latency. Would it be possible to test with my patch[*] that does not try reclaim to address the thrashing issue? If that is satisfactory, I don't have a strong preference if it is done with a hardcoded pageblock_order and __GFP_NORETRY check or a new __GFP_COMPACT_ONLY flag.
I think the second issue of faulting remote thp by removing __GFP_THISNODE needs supporting evidence that shows some platforms benefit from this (and not with numa=fake on the command line :).