On 08.02.25 22:34, Barry Song wrote:
On Sat, Feb 8, 2025 at 9:50 PM Ge Yang yangge1116@126.com wrote:
在 2025/1/28 17:58, Barry Song 写道:
On Sat, Jan 25, 2025 at 12:21 AM yangge1116@126.com wrote:
From: yangge yangge1116@126.com
Commit 60a60e32cf91 ("Revert "mm/cma.c: remove redundant cma_mutex lock"") simply reverts to the original method of using the cma_mutex to ensure that alloc_contig_range() runs sequentially. This change was made to avoid concurrency allocation failures. However, it can negatively impact performance when concurrent allocation of CMA memory is required.
Do we have some data?
Yes, I will add it in the next version, thanks.
To address this issue, we could introduce an API for concurrency settings, allowing users to decide whether their CMA can perform concurrent memory allocations or not.
Who is the intended user of cma_set_concurrency?
We have some drivers that use cma_set_concurrency(), but they have not yet been merged into the mainline. The cma_alloc_mem() function in the mainline also supports concurrent allocation of CMA memory. By applying this patch, we can also achieve significant performance improvements in certain scenarios. I will provide performance data in the next version. I also feel it is somewhat
unsafe since cma->concurr_alloc is not protected by any locks.
Ok, thanks.
Will a user setting cma->concurr_alloc = 1 encounter the original issue that commit 60a60e32cf91 was attempting to fix?
Yes, if a user encounters the issue described in commit 60a60e32cf91, they will not be able to set cma->concurr_alloc to 1.
A user who hasn't encountered a problem yet doesn't mean they won't encounter it; it most likely just means the testing time hasn't been long enough.
Is it possible to implement a per-CMA lock or range lock that simultaneously improves performance and prevents the original issue that commit 60a60e32cf91 aimed to fix?
I strongly believe that cma->concurr_alloc is not the right approach. Let's not waste our time on this kind of hack or workaround. Instead, we should find a proper fix that remains transparent to users.
Fully agreed.
IIUC, the problem is that we find a pageblock is already isolated. It might be sufficient to return -EAGAIN in that case from alloc_contig_range_noprof()->start_isolate_page_range() and retry in CMA. ideally, we'd have a way to wait on some event (e.g., any pageblock transitioning from isolated -> !isolated).