On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
On Fri, 5 Oct 2018, Andrea Arcangeli wrote:
I tried to add just __GFP_NORETRY but it changes nothing. Try it yourself if you think that can resolve the swap storm and excessive reclaim CPU overhead... and see if it works. I didn't intend to reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would have worked. I tried adding __GFP_NORETRY first of course.
Reason why it doesn't help is: compaction fails because not enough free RAM, reclaim is invoked, compaction succeeds, THP is allocated to your lib user, compaction fails because not enough free RAM, reclaim is invoked etc.. compact_result is not COMPACT_DEFERRED, but COMPACT_SKIPPED.
See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to still create the same heavy swap storm and unfairly penalize all apps with memory allocated in the local node like if your library had actually the kernel privilege to run mbind or mlock, which is not ok.
Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim can run with __GFP_THISNODE set, all bets are off and we're back to square one, no difference (at best marginal difference) with __GFP_NORETRY being set.
The page allocator is expecting __GFP_NORETRY for thp allocations per its comment:
/* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ if (costly_order && (gfp_mask & __GFP_NORETRY)) {
And that enables us to check compact_result to determine whether thp allocation should fail or continue to reclaim. I don't think it helps that some thp fault allocations use __GFP_NORETRY and others do not. I think that deserves a fix to alloc_hugepage_direct_gfpmask() or GFP_TRANSHUGE_LIGHT.
I am concerned that we may be trying to deal with this in terms of right and wrong when the range of workloads and their preferred semantics force us into a grey area.
The THP user without __GFP_NORETRY is "always"
always means that an application requesting THP will stall on allocation failure and directly reclaim pages and compact memory in an effort to allocate a THP immediately. This may be desirable for virtual machines that benefit heavily from THP use and are willing to delay the VM start to utilise them.
Removing __GFP_NORETRY in this instance is due to the tuning hinting that direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel will not compact memory after a recent failure but if the caller is willing to reclaim memory, then it follows that retrying compaction is justified.
Is this the correct thing to do in 100% of cases? Probably not, there will be corner cases but it also does not necessarily follow that __GFP_NORETRY should be univeral.
What is missing here is an agreed upon set of reference test cases that can be used for the basis of evaluating patches like this. They should be somewhat representative of the target applications of virtual memory initialisation (forget about runtime at the moment as that is stacking problems), a simulator of the google workload and library and my test case of simply referencing an amount of memory larger than one node. That would cover the current discussion at least but more would be needed later. Otherwise we're going to endlessly whack-a-mole fixing one workload and hurting another. It might be overkill but otherwise this discussion risks going in circles.
The previous reference cases were ones that focused on either THP allocation success rates or the benefit of THP itself, neither of which are particularly useful in the current context.
Our library that uses MADV_HUGEPAGE only invokes direct compaction because we're on an older kernel: it does not attempt to do reclaim to make compaction happy so that it finds memory that it can migrate memory to. For reference, we use defrag setting of "defer+madvise". Everybody who does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, MADV_HUGEPAGE users do the same but also try direct compaction. That direct compaction uses a mode of MIGRATE_ASYNC so it normally fails because of need_resched() or spinlock contention.
These allocations always fail, MADV_HUGEPAGE or otherwise, without invoking direct reclaim.
I am agreeing with both you and Mel that it makes no sense to thrash the local node to make compaction happy and then hugepage-order memory available. I'm only differing with you on the mechanism to fail early: we never want to do attempt reclaim on thp allocations specifically because it leads to the behavior you are addressing.
My contention is that removing __GFP_THISNODE papers over the problem, especially in cases where remote memory is also fragmnented. It incurs a much higher (40%) fault latency and then incurs 13.9% greater access latency. It is not a good result, at least for Haswell, Naples, and Rome.
Ok, this is a good point. MADV_HUGEPAGE is hinting that huge pages are desired and in recent kernels, that also meant they would be local pages i.e. locality is more important than huge pages even if huge pages are desirable. With this change, it indicates that huge pages are more important than locality because that is what userspace hinted.
Which is better?
locality is more important if your workload fits in memory and the initialisation phase is not performance-critical. It would also be more important if your workload is larger than a node but the critical working set fits within a node while the other usages are like streaming readers and writers where the data is referenced once and can be safely reclaimed later. huge pages is more important if your workload is virtualised and benefits heavily due to reduced TLB cost from EPT. Similarly it's better if the workload has high special locality, particularly if it is also fitting within a cache where the remote cost is masked.
These are simple examples and even then we cannot detect which case applies in advance so it falls back to what does the hint mean? The name suggests that huge pages are desirable and the locality is a hint. Granted, if that means THP is going remote and incurring cost there, it might be worse overall for some workloads, particularly if the system is fragmented.
This goes back to my point that the MADV_HUGEPAGE hint should not make promises about locality and that introducing MADV_LOCAL for specialised libraries may be more appropriate with the initial semantic being how it treats MADV_HUGEPAGE regions.
To make a case that we should fault hugepages remotely as fallback, either for non-MADV_HUGEPAGE users who do not use direct compaction, or MADV_HUGEPAGE users who use direct compaction, we need numbers that suggest there is a performance benefit in terms of access latency to suggest that it is better; this is especially the case when the fault latency is 40% higher. On Haswell, Naples, and Rome, it is quite obvious that this patch works much harder to fault memory remotely that incurs a substantial performance penalty when it fails and in the cases where it succeeds we have much worse access latency.
For users who bind their applications to a subset of cores on a NUMA node, fallback is egregious: we are guaranteed to never have local access latency and in the case when the local node is fragmented and remote memory is not our fault latency goes through the roof when local native pages would have been much better.
I've brought numbers on how poor this patch performs, so I'm asking for a rebuttal that suggests it is better on some platforms. (1) On what platforms is it better to fault remote hugepages over local native pages? (2) What is the fault latency increase when remote nodes are fragmented as well?
On the flip side, we've also heard how some other applications are adversely affected such as virtual machine start-ups.
What you're asking for asking is an astonishing amount of work -- a multi platform study against a wide variety of workloads with the addition that some test should be able to start in a fragmented state that is reproducible.
What is being asked of you is to consider introducing a new madvise hint and having MADV_HUGEPAGE being about huge pages and introducing a new hint that is hinting about locality without the strictness of memory binding. That is significantly more tractable and you also presumably have access to a reasonable reproduction cases even though I doubt you can release it.
Like Mel said, your app just happens to fit in a local node, if the user of the lib is slightly different and allocates 16G on a system where each node is 4G, the post-fix MADV_HUGEPAGE will perform extremely better also for the lib user.
It won't if the remote memory is fragmented; the patch is premised on the argument that remote memory is never fragmented or under memory pressure otherwise it is multiplying the fault latency by the number of nodes. Sure, creating test cases where the local node is under heavy memory pressure yet remote nodes are mostly free will minimize the impact this has on fault latency. It still is a substantial increase, as I've measured on Haswell, but the access latency forever is the penalty. This patch cannot make access to remote memory faster.
Indeed not but remote costs are highly platform dependant so decisions made for haswell are not necessarily justified on EPYC or any later Intel processor either. Because these goalposts will keep changing, I think it's better to have more clearly defined madvise hints and rely less on implementation details that are subject to change. Otherwise we will get stuck in an endless debate about "is workload X more important than Y"?
No doubt this patch will have problems but it treats MADV_HUGEPAGE as a more stricter hint that huge pages are desirable.
<SNIP>
I'm still unconvinced about the __GFP_NORETRY arguments because I tried it already. However if you send a patch that fixes everything by only adding __GFP_NORETRY in every place you wish, I'd be glad to test it and verify if it actually solves the problem. Also note:
+#define __GFP_ONLY_COMPACT ((__force gfp_t)(___GFP_NORETRY | \
___GFP_ONLY_COMPACT))
If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have defined to ___GFP_NORETRY alone. When I figured it wasn't enough I added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only place it's safe to retain it (i.e. compaction).
We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to continually compact memory in isolation if it already has failed. It seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, though, as I assume this would evolve into.
Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp allocations? If not, we should just save the gfp bit and encode the logic directly into the page allocator.
Would you support this?
I don't think it's necessarily bad but it cannot distinguish between THP and hugetlbfs. Hugetlbfs users are typically more willing to accept high overheads as they may be required for the application to function. That's probably fixable but will still leave us in the state where MADV_HUGEPAGE is also a hint about locality. It'd still be interesting to hear if it fixes the VM initialisation issue but do note that if this patch is used as a replacement that hugetlbfs users may complain down the line.