在 2024/12/18 11:29, Johannes Weiner 写道:
On Wed, Dec 18, 2024 at 10:15:06AM +0800, Ge Yang wrote:
在 2024/12/17 23:55, Johannes Weiner 写道:
Hello Yangge,
On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
From: yangge yangge1116@126.com
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()") allow compaction to proceed when free pages required for compaction reside in the CMA pageblocks, it's possible that __compaction_suitable() always returns true, and in some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. Since there is 16G of free CMA memory on the NUMA node, watermark for order-0 always be met for compaction, so __compaction_suitable() always returns true, even if the node is unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always returns true, __alloc_pages_slowpath() can't exit at the appropriate place, resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here
Other unmovable alloctions, like dma_buf, which can be large in a Linux system, are also unable to allocate memory from CMA, and these allocations suffer from the same problems described above. In order to quickly fall back to remote node, we should remove ALLOC_CMA both in __compaction_suitable() and __isolate_free_page() for unmovable alloctions. After this fix, starting a 32GB virtual machine with device passthrough takes only a few seconds.
The symptom is obviously bad, but I don't understand this fix.
The reason we do ALLOC_CMA is that, even for unmovable allocations, you can create space in non-CMA space by moving migratable pages over to CMA space. This is not a property we want to lose. But I also don't see how it would interfere with your scenario.
The __alloc_pages_slowpath() function was originally intended to exit at place 1, but due to __compaction_suitable() always returning true, it results in __alloc_pages_slowpath() exiting at place 2 instead. This ultimately leads to a significantly longer execution time for __alloc_pages_slowpath().
Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // place 1 __alloc_pages_direct_reclaim() // Reclaim is very expensive __alloc_pages_direct_compact() if (gfp_mask & __GFP_NORETRY) goto nopage; // place 2
Every time memory allocation goes through the above slower process, it ultimately leads to significantly longer virtual machine startup times.
I still don't follow. Why do you want the allocation to fail?
pin_user_pages_remote(..., FOLL_LONGTERM, ...) first attemps to allocate THP only on local node, and then fall back to remote NUMA nodes if the local allocation fail. For detail, see alloc_pages_mpol().
static struct page *alloc_pages_mpol() { page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try to allocate THP only on local node
if (page || !(gpf & __GFP_DIRECT_RECLAIM)) return page;
page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2, fall back to remote NUMA nodes }
The changelog says this is in order to fall back quickly to other nodes. But there is a full node walk in get_page_from_freelist() before the allocator even engages reclaim. There is something missing from the story still.
But regardless - surely you can see that we can't make the allocator generally weaker on large requests just because they happen to be optional in your specific case? >First, try to allocate THP on the local node as much as possible, and
then fall back to a remote node if the local allocation fail. This is the default memory allocation strategy when starting virtual machines.
There is the compaction_suitable() check in should_compact_retry(), but that only applies when COMPACT_SKIPPED. IOW, it should only happen when compaction_suitable() just now returned false. IOW, a race condition. Which is why it's also not subject to limited retries.
What's the exact condition that traps the allocator inside the loop?
The should_compact_retry() function was not executed, and the slow here was mainly due to the execution of __alloc_pages_direct_reclaim().
Ok.