On 23/04/2025 8:52 pm, Yabin Cui wrote:
On Tue, Apr 22, 2025 at 7:10 AM Leo Yan leo.yan@arm.com wrote:
On Tue, Apr 22, 2025 at 02:49:54PM +0200, Ingo Molnar wrote:
[...]
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default for all devices to solve the issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages. Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") that explains that the current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to optimize for fragmentation rather than performance. If these are rarely used features specifically when looking at performance should we not continue to optimize for performance? Or at least make it user configurable?
So there seems to be 3 categories:
- Must have physically contiguous AUX buffers, it's a hardware ABI. (PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)
- Would be nice to have continguous AUX buffers, for a bit more performance.
- Doesn't really care.
So we do have #1, and it appears Yabin's usecase is #3?
Yes, in my usecase, I care much more about MM-friendly than a little potential performance when using PMU. It's not a rarely used feature. On Android, we collect ETM data periodically on internal user devices for AutoFDO optimization (for both userspace libraries and the kernel). Allocating a large chunk of contiguous AUX pages (4M for each CPU) periodically is almost unbearable. The kernel may need to kill many processes to fulfill the request. It affects user experience even after using PMU.
I am totally fine to reuse PERF_PMU_CAP_AUX_NO_SG. If PMUs don't want to sacrifice performance for MM-friendly, why support scatter gather mode? If there are strong performance reasons to allocate contiguous AUX pages in scatter gather mode, I hope max_order is configurable in userspace.
Currently, max_order is affected by aux_watermark. But aux_watermark also affects how frequently the PMU overflows AUX buffer and notifies userspace. It's not ideal to set aux_watermark to 1 page size. So if we want to make max_order user configurable, maybe we can add a one bit field in perf_event_attr?
In Yabin's case, the AUX buffer work as a bounce buffer. The hardware trace data is copied by a driver from low level's contiguous buffer to the AUX buffer.
In this case we cannot benefit much from continguous AUX buffers.
Thanks, Leo
Hi Yabin,
So after doing some testing it looks like there is 0 difference in overhead for max_order=0 vs ensuring the buffer is one contiguous allocation for Arm SPE, and TRBE would be exactly the same. This makes sense because we're vmapping pages individually anyway regardless of the base allocation.
Seems like the performance optimization of the optimistically large mappings is only for devices that require extra buffer management stuff other than normal virtual memory. Can we add a new capability PERF_PMU_CAP_AUX_PREFER_LARGE and apply it to Intel PT and BTS? Then the old (before the optimistic large allocs change) max_order=0 behavior becomes the default again, and PREFER_LARGE is just for those two devices. Other and new devices would get the more memory friendly allocations by default, as it's unlikely they'll benefit from anything different.
Thanks James
On 4/28/25 14:26, James Clark wrote:
On 23/04/2025 8:52 pm, Yabin Cui wrote:
On Tue, Apr 22, 2025 at 7:10 AM Leo Yan leo.yan@arm.com wrote:
On Tue, Apr 22, 2025 at 02:49:54PM +0200, Ingo Molnar wrote:
[...]
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default for all devices to solve the issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages. Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") that explains that the current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to optimize for fragmentation rather than performance. If these are rarely used features specifically when looking at performance should we not continue to optimize for performance? Or at least make it user configurable?
So there seems to be 3 categories:
- 1) Must have physically contiguous AUX buffers, it's a hardware ABI. (PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)
- 2) Would be nice to have continguous AUX buffers, for a bit more performance.
- 3) Doesn't really care.
So we do have #1, and it appears Yabin's usecase is #3?
Yes, in my usecase, I care much more about MM-friendly than a little potential performance when using PMU. It's not a rarely used feature. On Android, we collect ETM data periodically on internal user devices for AutoFDO optimization (for both userspace libraries and the kernel). Allocating a large chunk of contiguous AUX pages (4M for each CPU) periodically is almost unbearable. The kernel may need to kill many processes to fulfill the request. It affects user experience even after using PMU.
I am totally fine to reuse PERF_PMU_CAP_AUX_NO_SG. If PMUs don't want to sacrifice performance for MM-friendly, why support scatter gather mode? If there are strong performance reasons to allocate contiguous AUX pages in scatter gather mode, I hope max_order is configurable in userspace.
Currently, max_order is affected by aux_watermark. But aux_watermark also affects how frequently the PMU overflows AUX buffer and notifies userspace. It's not ideal to set aux_watermark to 1 page size. So if we want to make max_order user configurable, maybe we can add a one bit field in perf_event_attr?
In Yabin's case, the AUX buffer work as a bounce buffer. The hardware trace data is copied by a driver from low level's contiguous buffer to the AUX buffer.
In this case we cannot benefit much from continguous AUX buffers.
Thanks, Leo
Hi Yabin,
So after doing some testing it looks like there is 0 difference in overhead for max_order=0 vs ensuring the buffer is one contiguous allocation for Arm SPE, and TRBE would be exactly the same. This makes sense because we're vmapping pages individually anyway regardless of the base allocation.
Right, that makes sense.
Seems like the performance optimization of the optimistically large mappings is only for devices that require extra buffer management stuff other than normal virtual memory. Can we add a new capability PERF_PMU_CAP_AUX_PREFER_LARGE and apply it to Intel PT and BTS? Then
s/PERF_PMU_CAP_AUX_PREFER_LARGE/PERF_PMU_CAP_AUX_PREFER_CONT/ ?
the old (before the optimistic large allocs change) max_order=0 behavior becomes the default again, and PREFER_LARGE is just for those two devices. Other and new devices would get the more memory friendly allocations by default, as it's unlikely they'll benefit from anything different.
Agreed. AFAICS PERF_PMU_CAP_AUX_NO_SG is just redundant as the default allocation is always contiguous (before the optimistic large allocs change). Hence replacing that with a new and better named capability which prefers contiguous allocation seems right.
Thanks James
CoreSight mailing list -- coresight@lists.linaro.org To unsubscribe send an email to coresight-leave@lists.linaro.org