On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
causes workloads to severely regress both in fault and access latency when we know that direct reclaim is unlikely to make direct compaction free an entire pageblock. It's more likely than not that the reclaim was pointless and the allocation will still fail.
How do you know that? If all RAM is full of filesystem cache, but it's not heavily fragmented by slab or other unmovable objects, compaction will succeed every single time after reclaim frees 2M of cache like it's asked to do.
reclaim succeeds every time, compaction then succeeds every time.
Not doing reclaim after COMPACT_SKIPPED is returned simply makes compaction unable to compact memory once all nodes are filled by filesystem cache.
Certainly it's better not to invoke reclaim at all if __GFP_THISNODE is set, than swapping out heavy over the local node. Doing so however has the drawback of reducing the direct compaction effectiveness. I don't think it's true that reclaim is generally "pointless", it's just that invoking any reclaim backfired so bad if __GFP_THISNODE was set, than anything else (including weakining compaction effectiveness) was better.
If memory compaction were patched such that it can report that it could successfully free a page of the specified order if there were free pages at the end of the zone it could migrate to, reclaim might be helpful. But with the current implementation, I don't think that is reliably possible. These free pages could easily be skipped over by the migration scanner because of the presence of slab pages, for example, and unavailable to the freeing scanner.
Yes there's one case where reclaim is "pointless", but it happens once and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip reclaim then.
So you're right when we hit fragmentation there's one and only one "pointless" reclaim invocation. And immediately after we also exponentially backoff on the compaction invocations with the compaction deferred logic.
We could try optimize away such "pointless" reclaim event for sure, but it's probably an optimization that may just get lost in the noise and may not be measurable, because it only happens once when the first full fragmentation is encountered.
I really think that for this patch to be merged over my proposed change that it needs to be clearly demonstrated that reclaim was successful in that it freed memory that was subsequently available to the compaction freeing scanner and that enabled entire pageblocks to become free. That, in my experience, will very very seldom be successful because of internal slab fragmentation, compaction_alloc() cannot soak up pages from the reclaimed memory, and potentially thrash the zone completely pointlessly. The last point is the problem being reported here, but the other two are as legitimate.
I think the demonstration can already be inferred, because if hit full memory fragmentation after every reclaim, __GFP_NORETRY would have solved the "pathological THP allocation behavior" without requiring your change to __GFP_NORETRY that makes it behave like __GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.
Anyway you can add a few statistic counters and verify in more accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed by a COMPACT_DEFERRED.
I'd appreciate if Andrea can test this patch, have a rebuttal that we should still remove __GFP_THISNODE because we don't care about locality as much as forming a hugepage, we can make that change, and then merge this instead of causing such massive fault and access latencies.
I can certainly test, but from source review I'm already convinced it'll solve fine the "pathological THP allocation behavior", no argument about that. It's certainly better and more correct your patch than the current upstream (no security issues with lack of permissions for __GFP_THISNODE anymore either).
I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY alternative I posted, for our testcase that hit into the "pathological THP allocation behavior".
Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE no matter which is the caller.
As opposed I let the caller choose and left __GFP_NORETRY semantics alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think letting the caller decide instead of hardcoding it for order 9 is better, because __GFP_COMPACT_ONLY made sense to be set only if __GFP_THISNODE was also set by the caller.
If a driver does an order 9 allocation with __GFP_THISNODE not set, your patch will prevent it to allocate remote THP if all remote nodes are full of cache (which is a reasonable common assumption as more THP are allocated over time eating in all free memory). My patch didn't alter that so I tend to prefer the __GFP_COMPACT_ONLY than the hardcoding __GFP_COMPACT_ONLY for all order 9 allocations regardless if __GFP_THISNODE is set or not.
Last but not the last, from another point of view, I thought calling remote compaction was a feature especially for MADV_HUGEPAGE. However if avoiding the 40% increase latency for MADV_HUGEPAGE was the primary motivation for __GFP_THISNODE, not sure how we can say that such an high allocation latency is only a concern for order 9 allocations and all other costly_order allocations (potentially with orders even higher than HPAGE_PMD_ORDER in fact) are ok to keep calling remote compaction and incur in the 40% higher latency.
Thanks, Andrea