This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 136 -vvv
136: CoreSight synthesized callchain:
--- start ---
test child forked, pid 3539
---- end(0) ----
136: CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Changes in v11:
- Rebase on latest perf-tools-next.
- Verified with "perf test coresight" and no regression.
- Link to v10: https://lore.kernel.org/r/20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b…
Changes in v10:
- Change to syscall(SYS_gettid) for build failure on x86 (James).
- Extracted sample thread stack into cs_etm__sample_branch_stack().
- Link to v9: https://lore.kernel.org/r/20260616-b4-arm_cs_callchain_support_v1-v9-0-f8fa…
Changes in v9:
- Added patch 01 to fixed thread leak during trace queue init (sashiko).
- Added check in instruction and branch samples in
cs_etm__add_stack_event() (sashiko).
- Released frontend_thread properly in cs_etm__context() (sashiko).
- Refined cs_etm__flush_all_stack() to use switch (sashiko).
- Gathered James' review tags.
- Rebased on the latest perf-tools-next.
- Link to v8: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v8-0-7379…
Changes in v8:
- Updated test_arm_coresight_disasm.sh to pass "--itrace=b" and updated
examples in arm-cs-trace-disasm.py (James).
- Removed static annotation in callchain workload and renamed functions
with prefix "callchain_" to reduce naming conflict (James).
- For callchain test pre-condition check, removed the aarch64 check and
added the root permission check (James).
- Resolved the shellcheck errors (James).
- Link to v7: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v7-0-1ba7…
Changes in v7:
- Rebased on the latest perf-tools-next.
- Used struct_size() for allocation callchain struct (James).
- Added a helper cs_etm__packet_has_taken_branch() (James).
- Minor improvements for the callchain test (used record-ctl FIFO and
reworked the validation callstack push / pop).
- Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f4…
Changes in v6:
- Heavily rewrote the patches since restarted the work after 6 years.
- Changed to use the common thread-stack for branch stack and callchain
management.
- Added a callchain test.
- Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Changes in v5:
- Addressed Mike's suggestion for performance improvement for function
cs_etm__instr_addr() for quick calculation for non T32;
- Removed the patch 'perf cs-etm: Synchronize instruction sample with
the thread stack' (Mike);
- Fixed the issue for exception is taken for branch target address
accessing, for the branch sample and stack thread handling, the
related patches are 01, 02, 07;
- Fixed the stack thread handling for instruction emulation and single
step with patches 08, 09.
- Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@lin…
---
Leo Yan (9):
perf cs-etm: Fix thread leaks on trace queue init failure
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
tools/perf/Documentation/perf-test.txt | 6 +-
tools/perf/scripts/python/arm-cs-trace-disasm.py | 9 +-
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/coresight/callchain.sh | 172 ++++++++++
.../shell/coresight/test_arm_coresight_disasm.sh | 4 +-
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 +
tools/perf/tests/workloads/callchain.c | 33 ++
tools/perf/util/cs-etm.c | 377 +++++++++++++--------
9 files changed, 454 insertions(+), 151 deletions(-)
---
base-commit: f6e5090f63b0a9f4c4c42c82348ade4132495ee7
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
On Tue, Jun 30, 2026 at 04:57:34PM -0700, Namhyung Kim wrote:
[...]
> Hmm.. it's not applying anymore.. Please rebase.
Thanks for reminding. I'll rebase it and resend today.
On Thu, Jul 02, 2026 at 09:56:05AM +0800, Jie Gan wrote:
> static void funnel_platform_remove(struct platform_device *pdev)
> {
> struct funnel_drvdata *drvdata = dev_get_drvdata(&pdev->dev);
>
> if (WARN_ON(!drvdata))
> return;
>
> - funnel_remove(&pdev->dev);
> + /*
> + * Resume the device so its clocks are enabled again, balancing the
> + * clk_disable_unprepare() that devm runs when the driver detaches.
> + * Then mark it suspended and drop the usage count taken here.
> + */
> pm_runtime_get_sync(&pdev->dev);
> + funnel_remove(&pdev->dev);
> pm_runtime_disable(&pdev->dev);
> + pm_runtime_set_suspended(&pdev->dev);
> + pm_runtime_put_noidle(&pdev->dev);
LGTM. Thanks for writing up the comment. Please proceed.
On Wed, Jul 01, 2026 at 02:05:02PM +0800, Jie Gan wrote:
> After probe, pm_runtime_put() allows the device to suspend and the
> runtime suspend callback disables the same clocks. During remove the
> device is left runtime suspended, so pm_runtime_disable() freezes it
> with the clocks already disabled. The devm cleanup that runs afterwards
> calls clk_disable_unprepare() a second time, underflowing the clock
> enable refcount.
Thanks for fixing the issue.
The problem is that if the device has already been runtime suspended and
its clock has been disabled, afterwards when remove the device, the devm
cleanup disables the clock again, resulting in clock count underflow.
> diff --git a/drivers/hwtracing/coresight/coresight-funnel.c b/drivers/hwtracing/coresight/coresight-funnel.c
> index 0abc11f0690c..4c5b94640e6a 100644
> --- a/drivers/hwtracing/coresight/coresight-funnel.c
> +++ b/drivers/hwtracing/coresight/coresight-funnel.c
> @@ -334,6 +334,7 @@ static void funnel_platform_remove(struct platform_device *pdev)
> return;
>
> funnel_remove(&pdev->dev);
> + pm_runtime_get_sync(&pdev->dev);
> pm_runtime_disable(&pdev->dev);
Let's use the funnel driver for the discussion. Once we agree on the
approach, we can apply the same change to the other CoreSight platform
drivers.
How about the following teardown?
static void funnel_platform_remove(struct platform_device *pdev)
{
struct funnel_drvdata *drvdata = dev_get_drvdata(&pdev->dev);
+ int ret;
if (WARN_ON(!drvdata))
return;
+ ret = pm_runtime_get_sync(&pdev->dev);
+ if (ret < 0)
+ dev_warn(&pdev->dev, "failed to resume before remove: %d\n", ret);
+
funnel_remove(&pdev->dev);
+
pm_runtime_disable(&pdev->dev);
+ pm_runtime_set_suspended(&pdev->dev);
+ pm_runtime_put_noidle(&pdev->dev);
}
The idea is to first resume the device with pm_runtime_get_sync(), then
perform the remove (which is safe if they need to access or clean up
hardware state), and finally clean up the runtime PM states. I mainly
referred to drivers/iio/adc/stm32-adc.c.
Thanks,
Leo
On Tue, Jun 30, 2026 at 04:42:39PM +0800, Jie Gan wrote:
[...]
> As Suzuki mentioned in the other thread, I think it would be better to add
> separate compatibles in the of_match_table to distinguish between Aggregator
> TraceNoC and Interconnect TraceNoC when probing with the platform driver.
> This would allow us to allocate an ATID only for Aggregator TraceNoC during
> probe, which is consistent with our original design.
Makes sense for me!
Hi Namhyung,
On Mon, Jun 29, 2026 at 05:36:48PM -0700, Namhyung Kim wrote:
[...]
> Will you send a new version or want to merge this? It seems there are
> some remaining comments from Sashiko.
I prefer to merge this series.
Sashiko reported several critical issues in the common code, they are on
my to-do list.
Thanks,
Leo
On 30/06/2026 02:03, Jie Gan wrote:
>
>
> On 6/29/2026 10:28 PM, Leo Yan wrote:
>> On Mon, Jun 29, 2026 at 10:08:17AM +0800, Jie Gan wrote:
>>
>> [...]
>>
>>> Can I fix the issue by adding "arm,primecell-periphid" property. That's
>>> would be the best temp solution as it avoids breaking the original
>>> design of
>>> both the TraceNoC AMBA driver and interconnect TraceNoC platform driver.
>>
>> Before proceeding with the "arm,primecell-periphid" property, could you
>> clarify a bit:
>>
>> - For an interconnect TraceNoC, what would be the consequence of
>> enabling ATID? Would it simply be a no-op, or are there any side
>> effects? Or is the concern that the trace IDs could be exhausted?
>>
>
> TPDM0(or ATB source) -> interconnect TraceNoC0 -> Aggregator TraceNoc ->
> sink
> TPDM1(or ATB source) -> interconnect TraceNoC1 -> Aggregator TraceNoc ->
> sink
>
> We only have one Aggregator TraceNoC and many interconnect TraceNoC
> devices for one platform. All interconnect TraceNoC devices are
> connected to Aggregator TraceNoC devices in the topology, so the itnoc
> doesnt need an ATID.
>
> That's the design purpose from hardware perspective.
>
>
>> - How can you guarantee that a interconnect TraceNoC will never
>> require ATID in the future?
>>
>
> The interconnect TraceNoC is primarily introduced to reduce routing
> complexity in the hardware design. It is typically deployed as an
> intermediate TraceNoC that connects to an Aggregator TraceNoC (AG
> TraceNoC).
You can always distinguish one from the other by checking the
"compatibles" or even add a custom data field to the of_device_id
table for the platform driver. Personally, I think it is better to
keep things away from AMBA framework, when we get everything from
platform driver.
Cheers
Suzuki
>
> For example, a modem subsystem may contain many TPDM devices. Directly
> connecting every TPDM to the AG TraceNoC would result in significant
> wiring complexity. Instead, an itnoc is placed within the modem
> subsystem to locally aggregate the TPDM connections. All TPDMs first
> connect to the itnoc, and the itnoc then connects to the system-level AG
> TraceNoC.
>
> From a hardware perspective, there is no fundamental difference between
> an itnoc and an AG TraceNoC. They use the same TraceNoC hardware
> implementation and share the same AMBA bus type. The distinction is
> purely functional: an itnoc is used for local trace aggregation within a
> subsystem, whereas an AG TraceNoC serves as the top-level aggregation
> point for the SoC.
>
> Thanks,
> Jie
>
>>> The TraceNoC device here must be treated as an AMBA device and I am
>>> continuing to investigate the issue with our hardware team.
>>
>>> We aim to fix it from hardware perspetive for existing platforms if
>>> possible
>>> and ensure it is fixed in future platforms.
>>
>> I'm concerned that all of use end up repeatedly fixing similar issues
>> whenever hardware configurations change or modules are reused in
>> different topologies.
>>
>> For example, if future platforms may require ATID support for an
>> interconnect TraceNoC, then the issue will pop up again.
>>
>> Thanks,
>> Leo
>
Hi Jie,
On Tue, Jun 30, 2026 at 09:03:52AM +0800, Jie Gan wrote:
[...]
> > - How can you guarantee that a interconnect TraceNoC will never
> > require ATID in the future?
> From a hardware perspective, there is no fundamental difference between an
> itnoc and an AG TraceNoC. They use the same TraceNoC hardware implementation
> and share the same AMBA bus type. The distinction is purely functional: an
> itnoc is used for local trace aggregation within a subsystem, whereas an AG
> TraceNoC serves as the top-level aggregation point for the SoC.
I'm still not convinced that adding "arm,primecell-periphid" is the
right approach.
From the description above, I'd expect either the hardware to expose
bits in a register to distinguish these two module types, or as I
suggested earlier, to use a DT property to indicate the module type (or
whether ATID is required).
Or have you tried to detect the last tnoc on a path and allocate ID for
it? (You can retrieve csdev->path).
Thanks,
Leo
On Mon, Jun 29, 2026 at 10:08:17AM +0800, Jie Gan wrote:
[...]
> Can I fix the issue by adding "arm,primecell-periphid" property. That's
> would be the best temp solution as it avoids breaking the original design of
> both the TraceNoC AMBA driver and interconnect TraceNoC platform driver.
Before proceeding with the "arm,primecell-periphid" property, could you
clarify a bit:
- For an interconnect TraceNoC, what would be the consequence of
enabling ATID? Would it simply be a no-op, or are there any side
effects? Or is the concern that the trace IDs could be exhausted?
- How can you guarantee that a interconnect TraceNoC will never
require ATID in the future?
> The TraceNoC device here must be treated as an AMBA device and I am
> continuing to investigate the issue with our hardware team.
> We aim to fix it from hardware perspetive for existing platforms if possible
> and ensure it is fixed in future platforms.
I'm concerned that all of use end up repeatedly fixing similar issues
whenever hardware configurations change or modules are reused in
different topologies.
For example, if future platforms may require ATID support for an
interconnect TraceNoC, then the issue will pop up again.
Thanks,
Leo