This series adds thread-stack and synthesized callchain support for Arm CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer. That effectively makes the state per CPU, while the call/return history belongs to a thread. This series moves branch tracking to the common thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses thread_stack__br_sample() for last branch entries, flushes thread stacks after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace discontinuity, so all thread stacks are flushed, so avoids carrying stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that case the exception return address may not match the return address stored in the thread stack, because after exception return can be one instruction ahead. The stack can still recover when a later return matches an upper caller. Given emulated instructions are not the common target for performance callchain analysis. Supporting this would require extending the common thread-stack path to accept both the real target address and an adjusted address for stack matching, so this series leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 136 -vvv 136: CoreSight synthesized callchain: --- start --- test child forked, pid 3539 ---- end(0) ---- 136: CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions: aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions: aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions: ffff800080010c20 vectors+0x420 ([kernel.kallsyms]) aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity packets (mainly caused by NO_SYNC elem). This is likely caused by the FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linar...
Signed-off-by: Leo Yan leo.yan@arm.com --- Changes in v7: - Rebased on the latest perf-tools-next. - Used struct_size() for allocation callchain struct (James). - Added a helper cs_etm__packet_has_taken_branch() (James). - Minor improvements for the callchain test (used record-ctl FIFO and reworked the validation callstack push / pop). - Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49...
Changes in v6: - Heavily rewrote the patches since restarted the work after 6 years. - Changed to use the common thread-stack for branch stack and callchain management. - Added a callchain test. - Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linar...
Changes in v5: - Addressed Mike's suggestion for performance improvement for function cs_etm__instr_addr() for quick calculation for non T32; - Removed the patch 'perf cs-etm: Synchronize instruction sample with the thread stack' (Mike); - Fixed the issue for exception is taken for branch target address accessing, for the branch sample and stack thread handling, the related patches are 01, 02, 07; - Fixed the stack thread handling for instruction emulation and single step with patches 08, 09. - Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@lina...
Changes in v4: - Split out separate patch set for instruction samples fixing. - Rebased on latest perf/core branch. - Link to v3: https://lore.kernel.org/linux-arm-kernel/20191005091614.11635-1-leo.yan@lina...
--- Leo Yan (8): perf cs-etm: Filter synthesized branch samples perf cs-etm: Decode ETE exception packets perf cs-etm: Refactor instruction size handling perf cs-etm: Use thread-stack for last branch entries perf cs-etm: Flush thread stacks after decoder reset perf cs-etm: Support call indentation perf cs-etm: Synthesize callchains for instruction samples perf test: Add Arm CoreSight callchain test
tools/perf/Documentation/perf-test.txt | 6 +- tools/perf/tests/builtin-test.c | 1 + tools/perf/tests/shell/coresight/callchain.sh | 168 ++++++++++++ tools/perf/tests/tests.h | 1 + tools/perf/tests/workloads/Build | 2 + tools/perf/tests/workloads/callchain.c | 24 ++ tools/perf/util/cs-etm.c | 351 +++++++++++++++----------- 7 files changed, 410 insertions(+), 143 deletions(-) --- base-commit: 7336514f41e75d44782fee7e0990d4195a3d3161 change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,