On Tue, 9 Oct 2018, Andrea Arcangeli wrote:
I think "madvise vs mbind" is more an issue of "no-permission vs permission" required. And if the processes ends up swapping out all other process with their memory already allocated in the node, I think some permission is correct to be required, in which case an mbind looks a better fit. MPOL_PREFERRED also looks a first candidate for investigation as it's already not black and white and allows spillover and may already do the right thing in fact if set on top of MADV_HUGEPAGE.
We would never want to thrash the local node for hugepages because there is no guarantee that any swapping is useful. On COMPACT_SKIPPED due to low memory, we have very clear evidence that pageblocks are already sufficiently fragmented by unmovable pages such that compaction itself, even with abundant free memory, fails to free an entire pageblock due to the allocator's preference to fragment pageblocks of fallback migratetypes over returning remote free memory.
As I've stated, we do not want to reclaim pointlessly when compaction is unable to access the freed memory or there is no guarantee it can free an entire pageblock. Doing so allows thrashing of the local node, or remote nodes if __GFP_THISNODE is removed, and the hugepage still cannot be allocated. If this proposed mbind() that requires permissions is geared to me as the user, I'm afraid the details of what leads to the thrashing are not well understood because I certainly would never use this.