New subject: [PATCH 1/2] perf: Allow non-contiguous AUX buffer pages via PMU capability

28 Apr 2025

      On 23/04/2025 8:52 pm, Yabin Cui wrote:
...
On Tue, Apr 22, 2025 at 7:10 AM Leo Yan leo.yan@arm.com wrote:
...
On Tue, Apr 22, 2025 at 02:49:54PM +0200, Ingo Molnar wrote:
[...]
...
...
Hi Yabin,
I was wondering if this is just the opposite of
PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default
for all devices to solve the issue you describe. Because we already
have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages.
Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order
allocations for AUX buffers optimistically") that explains that the
current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to
optimize for fragmentation rather than performance. If these are
rarely used features specifically when looking at performance should
we not continue to optimize for performance? Or at least make it user
configurable?
So there seems to be 3 categories:

Must have physically contiguous AUX buffers, it's a hardware ABI.
(PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)

Would be nice to have continguous AUX buffers, for a bit more
performance.

Doesn't really care.

So we do have #1, and it appears Yabin's usecase is #3?
Yes, in my usecase, I care much more about MM-friendly than a little potential
performance when using PMU. It's not a rarely used feature. On Android, we
collect ETM data periodically on internal user devices for AutoFDO optimization
(for both userspace libraries and the kernel). Allocating a large
chunk of contiguous
AUX pages (4M for each CPU) periodically is almost unbearable. The kernel may
need to kill many processes to fulfill the request. It affects user
experience even
after using PMU.
I am totally fine to reuse PERF_PMU_CAP_AUX_NO_SG. If PMUs don't want to
sacrifice performance for MM-friendly, why support scatter gather mode? If there
are strong performance reasons to allocate contiguous AUX pages in
scatter gather
mode, I hope max_order is configurable in userspace.
Currently, max_order is affected by aux_watermark. But aux_watermark
also affects
how frequently the PMU overflows AUX buffer and notifies userspace.
It's not ideal
to set aux_watermark to 1 page size. So if we want to make max_order user
configurable, maybe we can add a one bit field in perf_event_attr?
...
In Yabin's case, the AUX buffer work as a bounce buffer.  The hardware
trace data is copied by a driver from low level's contiguous buffer to
the AUX buffer.
In this case we cannot benefit much from continguous AUX buffers.
Thanks,
Leo
Hi Yabin,
So after doing some testing it looks like there is 0 difference in 
overhead for max_order=0 vs ensuring the buffer is one contiguous 
allocation for Arm SPE, and TRBE would be exactly the same. This makes 
sense because we're vmapping pages individually anyway regardless of the 
base allocation.
Seems like the performance optimization of the optimistically large 
mappings is only for devices that require extra buffer management stuff 
other than normal virtual memory. Can we add a new capability 
PERF_PMU_CAP_AUX_PREFER_LARGE and apply it to Intel PT and BTS? Then the 
old (before the optimistic large allocs change) max_order=0 behavior 
becomes the default again, and PREFER_LARGE is just for those two 
devices. Other and new devices would get the more memory friendly 
allocations by default, as it's unlikely they'll benefit from anything 
different.
Thanks
James

Re: [PATCH 1/2] perf: Allow non-contiguous AUX buffer pages via PMU capability