This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 136 -vvv
136: CoreSight synthesized callchain:
--- start ---
test child forked, pid 3539
---- end(0) ----
136: CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Changes in v8:
- Updated test_arm_coresight_disasm.sh to pass "--itrace=b" and updated
examples in arm-cs-trace-disasm.py (James).
- Removed static annotation in callchain workload and renamed functions
with prefix "callchain_" to reduce naming conflict (James).
- For callchain test pre-condition check, removed the aarch64 check and
added the root permission check (James).
- Resolved the shellcheck errors (James).
- Link to v7: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v7-0-1ba7…
Changes in v7:
- Rebased on the latest perf-tools-next.
- Used struct_size() for allocation callchain struct (James).
- Added a helper cs_etm__packet_has_taken_branch() (James).
- Minor improvements for the callchain test (used record-ctl FIFO and
reworked the validation callstack push / pop).
- Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f4…
Changes in v6:
- Heavily rewrote the patches since restarted the work after 6 years.
- Changed to use the common thread-stack for branch stack and callchain
management.
- Added a callchain test.
- Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Changes in v5:
- Addressed Mike's suggestion for performance improvement for function
cs_etm__instr_addr() for quick calculation for non T32;
- Removed the patch 'perf cs-etm: Synchronize instruction sample with
the thread stack' (Mike);
- Fixed the issue for exception is taken for branch target address
accessing, for the branch sample and stack thread handling, the
related patches are 01, 02, 07;
- Fixed the stack thread handling for instruction emulation and single
step with patches 08, 09.
- Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@lin…
Changes in v4:
- Split out separate patch set for instruction samples fixing.
- Rebased on latest perf/core branch.
- Link to v3: https://lore.kernel.org/linux-arm-kernel/20191005091614.11635-1-leo.yan@lin…
---
Leo Yan (8):
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
tools/perf/Documentation/perf-test.txt | 6 +-
tools/perf/scripts/python/arm-cs-trace-disasm.py | 9 +-
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/coresight/callchain.sh | 172 ++++++++++
.../shell/coresight/test_arm_coresight_disasm.sh | 4 +-
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 +
tools/perf/tests/workloads/callchain.c | 33 ++
tools/perf/util/cs-etm.c | 351 ++++++++++++---------
9 files changed, 430 insertions(+), 149 deletions(-)
---
base-commit: 7336514f41e75d44782fee7e0990d4195a3d3161
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 136 -vvv
136: CoreSight synthesized callchain:
--- start ---
test child forked, pid 3539
---- end(0) ----
136: CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Changes in v7:
- Rebased on the latest perf-tools-next.
- Used struct_size() for allocation callchain struct (James).
- Added a helper cs_etm__packet_has_taken_branch() (James).
- Minor improvements for the callchain test (used record-ctl FIFO and
reworked the validation callstack push / pop).
- Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f4…
Changes in v6:
- Heavily rewrote the patches since restarted the work after 6 years.
- Changed to use the common thread-stack for branch stack and callchain
management.
- Added a callchain test.
- Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Changes in v5:
- Addressed Mike's suggestion for performance improvement for function
cs_etm__instr_addr() for quick calculation for non T32;
- Removed the patch 'perf cs-etm: Synchronize instruction sample with
the thread stack' (Mike);
- Fixed the issue for exception is taken for branch target address
accessing, for the branch sample and stack thread handling, the
related patches are 01, 02, 07;
- Fixed the stack thread handling for instruction emulation and single
step with patches 08, 09.
- Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@lin…
Changes in v4:
- Split out separate patch set for instruction samples fixing.
- Rebased on latest perf/core branch.
- Link to v3: https://lore.kernel.org/linux-arm-kernel/20191005091614.11635-1-leo.yan@lin…
---
Leo Yan (8):
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
tools/perf/Documentation/perf-test.txt | 6 +-
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/coresight/callchain.sh | 168 ++++++++++++
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 +
tools/perf/tests/workloads/callchain.c | 24 ++
tools/perf/util/cs-etm.c | 351 +++++++++++++++-----------
7 files changed, 410 insertions(+), 143 deletions(-)
---
base-commit: 7336514f41e75d44782fee7e0990d4195a3d3161
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
Fix thread tracking when decoding Coresight trace and add a new test for
it.
The new test is added as a Perf test workload instead of a custom binary
with its own build system, but this requires a new feature in Perf test
to pass in control pipes which can enable and disable events. This
scopes the recording to just the workload and helps to reduce the amount
of data recorded in tracing tests.
With this new feature we can re-write all of the Coresight tests to make
use of it and remove the remaining binaries which fixes the following
issues:
* They didn't work in out of source builds
* A lot of the tests unnecessarily required root and didn't skip
without it
* They were mainly qualitative tests which didn't look for specific
behavior
Most importantly, the long build and runtime has been reduced. On a
Radxa Orion O6, unroll_loop_thread.c took 37s to compile which is longer
than the entire Perf build. Now the build time is negligible and the
before and after test runtimes for all the Coresight tests are:
| N1SDP | Orion O6
-----------------------------------
Before | 4m 0s | 14m 49s
After | 26s | 56s
-----------------------------------
Signed-off-by: James Clark <james.clark(a)linaro.org>
---
Changes in v5:
- Forgot to include this change:
- Test for actual length of expected raw dump (Leo)
- Link to v4: https://lore.kernel.org/r/20260609-james-cs-context-tracking-fix-v4-0-44f9f…
Changes in v4:
- Rename workload-ctl to record-ctl and improve docs (Leo)
- Use new packet argument everywhere in
cs_etm__synth_instruction_sample() (Sashiko)
- Test for actual length of expected raw dump (Leo)
- Use -fno-inline instead of keyword (Leo)
- Don't test any brace or call lines in deterministic test
- Make sure context switch loop test does cleanup on failure (Sashiko)
- Remove undef int overflows in workloads (Sashiko)
- Link to v3: https://lore.kernel.org/r/20260603-james-cs-context-tracking-fix-v3-0-c3929…
Changes in v3:
- Minor sashiko comments
- Close some more pipes
- Fix warning messages
- Error handling improvements
- Pass packet into cs_etm__synth_instruction_sample()
- Fixup stale comment (Leo)
- Link to v2: https://lore.kernel.org/r/20260602-james-cs-context-tracking-fix-v2-0-85b5c…
Changes in v2:
- Add --workload-ctl option to Perf test
- Re-write all the Coresight tests and speed them up
- Pass packet to memory access function so frontend can use either the
previous or current packet's EL
- Link to v1: https://lore.kernel.org/r/20260526-james-cs-context-tracking-fix-v1-0-ebd60…
---
James Clark (19):
perf cs-etm: Queue context packets for frontend
perf test: Add workload-ctl option
perf test: Add a workload that forces context switches
perf test cs-etm: Test process attribution
perf test: Add deterministic workload
perf test cs-etm: Replace unroll loop thread with deterministic decode test
perf test cs-etm: Remove asm_pure_loop test
perf test cs-etm: Replace memcpy test with raw dump stress test
perf test: Add named_threads workload
perf test cs-etm: Test decoding for concurrent threads test
perf test cs-etm: Remove duplicate branch tests
perf test cs-etm: Skip if not root
perf test cs-etm: Reduce snapshot size
perf test cs-etm: Speed up basic test
perf test cs-etm: Remove unused Coresight workloads
perf test cs-etm: Make disassembly test use kcore
perf test cs-etm: Add all branch instructions to test
perf test cs-etm: Speed up disassembly test
perf test cs-etm: Move existing tests to coresight folder
Documentation/trace/coresight/coresight-perf.rst | 78 +------
MAINTAINERS | 2 -
tools/perf/Documentation/perf-test.txt | 24 ++-
tools/perf/Makefile.perf | 14 +-
tools/perf/scripts/python/arm-cs-trace-disasm.py | 20 +-
tools/perf/tests/builtin-test.c | 187 +++++++++++++++-
tools/perf/tests/shell/coresight/Makefile | 29 ---
.../perf/tests/shell/coresight/Makefile.miniconfig | 14 --
tools/perf/tests/shell/coresight/asm_pure_loop.sh | 22 --
.../tests/shell/coresight/asm_pure_loop/.gitignore | 1 -
.../tests/shell/coresight/asm_pure_loop/Makefile | 34 ---
.../shell/coresight/asm_pure_loop/asm_pure_loop.S | 30 ---
.../tests/shell/coresight/concurrent_threads.sh | 45 ++++
.../tests/shell/coresight/context_switch_thread.sh | 69 ++++++
tools/perf/tests/shell/coresight/deterministic.sh | 72 +++++++
.../tests/shell/coresight/memcpy_thread/.gitignore | 1 -
.../tests/shell/coresight/memcpy_thread/Makefile | 33 ---
.../shell/coresight/memcpy_thread/memcpy_thread.c | 80 -------
.../tests/shell/coresight/memcpy_thread_16k_10.sh | 22 --
.../perf/tests/shell/coresight/raw_dump_stress.sh | 65 ++++++
.../shell/{ => coresight}/test_arm_coresight.sh | 43 ++--
.../{ => coresight}/test_arm_coresight_disasm.sh | 23 +-
.../tests/shell/coresight/thread_loop/.gitignore | 1 -
.../tests/shell/coresight/thread_loop/Makefile | 33 ---
.../shell/coresight/thread_loop/thread_loop.c | 85 --------
.../shell/coresight/thread_loop_check_tid_10.sh | 23 --
.../shell/coresight/thread_loop_check_tid_2.sh | 23 --
.../shell/coresight/unroll_loop_thread/.gitignore | 1 -
.../shell/coresight/unroll_loop_thread/Makefile | 33 ---
.../unroll_loop_thread/unroll_loop_thread.c | 75 -------
.../tests/shell/coresight/unroll_loop_thread_10.sh | 22 --
tools/perf/tests/shell/lib/coresight.sh | 134 ------------
tools/perf/tests/tests.h | 3 +
tools/perf/tests/workloads/Build | 4 +
tools/perf/tests/workloads/context_switch_loop.c | 110 ++++++++++
tools/perf/tests/workloads/deterministic.c | 39 ++++
tools/perf/tests/workloads/named_threads.c | 109 ++++++++++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 21 +-
tools/perf/util/cs-etm.c | 236 ++++++++++++---------
tools/perf/util/cs-etm.h | 8 +-
40 files changed, 926 insertions(+), 942 deletions(-)
---
base-commit: 351a37f2fda4db668cff8ba12f2992d73dccdaea
change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
Best regards,
--
James Clark <james.clark(a)linaro.org>
Fix thread tracking when decoding Coresight trace and add a new test for
it.
The new test is added as a Perf test workload instead of a custom binary
with its own build system, but this requires a new feature in Perf test
to pass in control pipes which can enable and disable events. This
scopes the recording to just the workload and helps to reduce the amount
of data recorded in tracing tests.
With this new feature we can re-write all of the Coresight tests to make
use of it and remove the remaining binaries which fixes the following
issues:
* They didn't work in out of source builds
* A lot of the tests unnecessarily required root and didn't skip
without it
* They were mainly qualitative tests which didn't look for specific
behavior
Most importantly, the long build and runtime has been reduced. On a
Radxa Orion O6, unroll_loop_thread.c took 37s to compile which is longer
than the entire Perf build. Now the build time is negligible and the
before and after test runtimes for all the Coresight tests are:
| N1SDP | Orion O6
-----------------------------------
Before | 4m 0s | 14m 49s
After | 26s | 56s
-----------------------------------
Signed-off-by: James Clark <james.clark(a)linaro.org>
---
Changes in v4:
- Rename workload-ctl to record-ctl and improve docs (Leo)
- Use new packet argument everywhere in
cs_etm__synth_instruction_sample() (Sashiko)
- Test for actual length of expected raw dump (Leo)
- Use -fno-inline instead of keyword (Leo)
- Don't test any brace or call lines in deterministic test
- Make sure context switch loop test does cleanup on failure (Sashiko)
- Remove undef int overflows in workloads (Sashiko)
- Link to v3: https://lore.kernel.org/r/20260603-james-cs-context-tracking-fix-v3-0-c3929…
Changes in v3:
- Minor sashiko comments
- Close some more pipes
- Fix warning messages
- Error handling improvements
- Pass packet into cs_etm__synth_instruction_sample()
- Fixup stale comment (Leo)
- Link to v2: https://lore.kernel.org/r/20260602-james-cs-context-tracking-fix-v2-0-85b5c…
Changes in v2:
- Add --workload-ctl option to Perf test
- Re-write all the Coresight tests and speed them up
- Pass packet to memory access function so frontend can use either the
previous or current packet's EL
- Link to v1: https://lore.kernel.org/r/20260526-james-cs-context-tracking-fix-v1-0-ebd60…
---
James Clark (19):
perf cs-etm: Queue context packets for frontend
perf test: Add workload-ctl option
perf test: Add a workload that forces context switches
perf test cs-etm: Test process attribution
perf test: Add deterministic workload
perf test cs-etm: Replace unroll loop thread with deterministic decode test
perf test cs-etm: Remove asm_pure_loop test
perf test cs-etm: Replace memcpy test with raw dump stress test
perf test: Add named_threads workload
perf test cs-etm: Test decoding for concurrent threads test
perf test cs-etm: Remove duplicate branch tests
perf test cs-etm: Skip if not root
perf test cs-etm: Reduce snapshot size
perf test cs-etm: Speed up basic test
perf test cs-etm: Remove unused Coresight workloads
perf test cs-etm: Make disassembly test use kcore
perf test cs-etm: Add all branch instructions to test
perf test cs-etm: Speed up disassembly test
perf test cs-etm: Move existing tests to coresight folder
Documentation/trace/coresight/coresight-perf.rst | 78 +------
MAINTAINERS | 2 -
tools/perf/Documentation/perf-test.txt | 24 ++-
tools/perf/Makefile.perf | 14 +-
tools/perf/scripts/python/arm-cs-trace-disasm.py | 20 +-
tools/perf/tests/builtin-test.c | 187 +++++++++++++++-
tools/perf/tests/shell/coresight/Makefile | 29 ---
.../perf/tests/shell/coresight/Makefile.miniconfig | 14 --
tools/perf/tests/shell/coresight/asm_pure_loop.sh | 22 --
.../tests/shell/coresight/asm_pure_loop/.gitignore | 1 -
.../tests/shell/coresight/asm_pure_loop/Makefile | 34 ---
.../shell/coresight/asm_pure_loop/asm_pure_loop.S | 30 ---
.../tests/shell/coresight/concurrent_threads.sh | 45 ++++
.../tests/shell/coresight/context_switch_thread.sh | 69 ++++++
tools/perf/tests/shell/coresight/deterministic.sh | 72 +++++++
.../tests/shell/coresight/memcpy_thread/.gitignore | 1 -
.../tests/shell/coresight/memcpy_thread/Makefile | 33 ---
.../shell/coresight/memcpy_thread/memcpy_thread.c | 80 -------
.../tests/shell/coresight/memcpy_thread_16k_10.sh | 22 --
.../perf/tests/shell/coresight/raw_dump_stress.sh | 47 ++++
.../shell/{ => coresight}/test_arm_coresight.sh | 43 ++--
.../{ => coresight}/test_arm_coresight_disasm.sh | 23 +-
.../tests/shell/coresight/thread_loop/.gitignore | 1 -
.../tests/shell/coresight/thread_loop/Makefile | 33 ---
.../shell/coresight/thread_loop/thread_loop.c | 85 --------
.../shell/coresight/thread_loop_check_tid_10.sh | 23 --
.../shell/coresight/thread_loop_check_tid_2.sh | 23 --
.../shell/coresight/unroll_loop_thread/.gitignore | 1 -
.../shell/coresight/unroll_loop_thread/Makefile | 33 ---
.../unroll_loop_thread/unroll_loop_thread.c | 75 -------
.../tests/shell/coresight/unroll_loop_thread_10.sh | 22 --
tools/perf/tests/shell/lib/coresight.sh | 134 ------------
tools/perf/tests/tests.h | 3 +
tools/perf/tests/workloads/Build | 4 +
tools/perf/tests/workloads/context_switch_loop.c | 110 ++++++++++
tools/perf/tests/workloads/deterministic.c | 39 ++++
tools/perf/tests/workloads/named_threads.c | 109 ++++++++++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 21 +-
tools/perf/util/cs-etm.c | 236 ++++++++++++---------
tools/perf/util/cs-etm.h | 8 +-
40 files changed, 908 insertions(+), 942 deletions(-)
---
base-commit: 351a37f2fda4db668cff8ba12f2992d73dccdaea
change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
Best regards,
--
James Clark <james.clark(a)linaro.org>
On 08/06/2026 17:55, Kuan-Wei Chiu wrote:
> Hi Suzuki,
>
> On Fri, Apr 03, 2026 at 04:57:59PM +0800, Kuan-Wei Chiu wrote:
>> Hi Suzuki,
>>
>> On Mon, Feb 02, 2026 at 09:33:59AM +0000, Suzuki K Poulose wrote:
>>> Hello
>>>
>>> On 02/02/2026 05:09, Kuan-Wei Chiu wrote:
>>>> On Tue, Dec 02, 2025 at 09:26:19AM +0000, James Clark wrote:
>>>>>
>>>>>
>>>>> On 02/12/2025 8:26 am, Kuan-Wei Chiu wrote:
>>>>>> The cntr_val_show() function was intended to print the values of all
>>>>>> counters using a loop. However, due to a buffer overwrite issue with
>>>>>> sprintf(), it effectively only displayed the value of the last counter.
>>>>>>
>>>>>> The companion function, cntr_val_store(), allows users to modify a
>>>>>> specific counter selected by 'cntr_idx'. To maintain consistency
>>>>>> between read and write operations and to align with the ETM4x driver
>>>>>> behavior, modify cntr_val_show() to report only the value of the
>>>>>> currently selected counter.
>>>>>>
>>>>>> This change removes the loop and the "counter %d:" prefix, printing
>>>>>> only the hexadecimal value. It also adopts sysfs_emit() for standard
>>>>>> sysfs output formatting.
>>>>>>
>>>>>> Fixes: a939fc5a71ad ("coresight-etm: add CoreSight ETM/PTM driver")
>>>>>> Cc: stable(a)vger.kernel.org
>>>>>> Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
>>>>>> ---
>>>>>> Build test only.
>>>>>>
>>>>>> Changes in v3:
>>>>>> - Switch format specifier to %#x to include the 0x prefix.
>>>>>> - Add Cc stable
>>>>>>
>>>>>> v2: https://lore.kernel.org/lkml/20251201095228.1905489-1-visitorckw@gmail.com/
>>>>>>
>>>>>> .../hwtracing/coresight/coresight-etm3x-sysfs.c | 15 ++++-----------
>>>>>> 1 file changed, 4 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
>>>>>> index 762109307b86..b3c67e96a82a 100644
>>>>>> --- a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
>>>>>> +++ b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
>>>>>> @@ -717,26 +717,19 @@ static DEVICE_ATTR_RW(cntr_rld_event);
>>>>>> static ssize_t cntr_val_show(struct device *dev,
>>>>>> struct device_attribute *attr, char *buf)
>>>>>> {
>>>>>> - int i, ret = 0;
>>>>>> u32 val;
>>>>>> struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
>>>>>> struct etm_config *config = &drvdata->config;
>>>>>> if (!coresight_get_mode(drvdata->csdev)) {
>>>>>> spin_lock(&drvdata->spinlock);
>>>>>> - for (i = 0; i < drvdata->nr_cntr; i++)
>>>>>> - ret += sprintf(buf, "counter %d: %x\n",
>>>>>> - i, config->cntr_val[i]);
>>>>>> + val = config->cntr_val[config->cntr_idx];
>>>>>> spin_unlock(&drvdata->spinlock);
>>>>>> - return ret;
>>>>>> - }
>>>>>> -
>>>>>> - for (i = 0; i < drvdata->nr_cntr; i++) {
>>>>>> - val = etm_readl(drvdata, ETMCNTVRn(i));
>>>>>> - ret += sprintf(buf, "counter %d: %x\n", i, val);
>>>>>> + } else {
>>>>>> + val = etm_readl(drvdata, ETMCNTVRn(config->cntr_idx));
>>>>>> }
>>>>>> - return ret;
>>>>>> + return sysfs_emit(buf, "%#x\n", val);
>>>>>> }
>>>>>> static ssize_t cntr_val_store(struct device *dev,
>>>>>
>>>>> Reviewed-by: James Clark <james.clark(a)linaro.org>
>>>>>
>>>> Thanks for the review!
>>>> Is there anything else I need to do for this fix to land?
>>>
>>> Thanks for the patch, I will queue this for the next release (v7.1).
>>>
>> Just a gentle ping.
>>
>> Since the v7.1 merge window is presumably opening in about a week, I
>> noticed this patch isn't in linux-next yet and wanted to send a quick
>> reminder. Thanks.
>>
> This patch still applies cleanly on top of linux-next.
> I suspect this patch may have fallen through the cracks.
> Would you still be willing to pick it up?
Apologies, it did. I will pick this up, if we have sufficient fixes,
I might send it as fixes for v7.2, otherwise , queue it for v7.3
Once again, apologies.
Suzuki
>
> Regards,
> Kuan-Wei
This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 150 -vvv
150: Check Arm CoreSight synthesized callchain:
--- start ---
test child forked, pid 13528
Test callchain push: PASS
Test callchain pop: PASS
---- end(0) ----
150: Check Arm CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Leo Yan (8):
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
.../tests/shell/test_arm_coresight_callchain.sh | 235 ++++++++++++++++
tools/perf/util/cs-etm.c | 309 ++++++++++++---------
2 files changed, 408 insertions(+), 136 deletions(-)
---
base-commit: bd2a5be1fe731bc7548205dd148db75f1d588da2
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
Hi Greg
Please find the updates for CoreSight self hosted tracing subsystem targeting
Linux v7.2
Kindly pull,
Suzuki
---
The following changes since commit 7fd2df204f342fc17d1a0bfcd474b24232fb0f32:
Linux 7.1-rc2 (2026-05-03 14:21:25 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git tags/coresight-next-v7.2
for you to fetch changes up to 98495b5a4d77dd22e106f462b76e1093a55b29a7:
coresight: ultrasoc-smb: Fix OOB write in smb_sync_perf_buffer() (2026-06-04 09:56:13 +0100)
----------------------------------------------------------------
coresight: Self-hosted tracing updates for Linux v7.2
Updates for the CoreSight self hosted tracing subsystem includes:
- Better power management for components based on the CPU PM, including
support for components on the trace path for CPUs. Add support for
save/restore for TRBE
- Miscellaneous fixes to the drivers
* Fix overflow when the buffer size is > 2GB for tmc-etr
* Ultrasoc SMB Perf buffer OOB access
Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
----------------------------------------------------------------
James Clark (1):
coresight: ete: Always save state on power down
Jie Gan (3):
coresight: fix missing error code when trace ID is invalid
coresight: Fix source not disabled on idr_alloc_u32 failure
coresight: platform: defer connection counter increment until alloc succeeds
Junrui Luo (1):
coresight: ultrasoc-smb: Fix OOB write in smb_sync_perf_buffer()
Leo Yan (28):
coresight: tmc: Fix overflow when calculating is bigger than 2GiB
coresight: etm4x: Correct TRCVMIDCCTLR1 save and restore
coresight: Handle helper enable failure properly
coresight: Extract device init into coresight_init_device()
coresight: Populate CPU ID into coresight_device
coresight: Remove .cpu_id() callback from source ops
coresight: Take hotplug lock in enable_source_store() for Sysfs mode
coresight: perf: Retrieve path and source from event data
coresight: Take a reference on csdev
coresight: Move per-CPU source pointer to core layer
coresight: Take per-CPU source reference during AUX setup
coresight: Register CPU PM notifier in core layer
coresight: etm4x: Hook CPU PM callbacks
coresight: etm4x: Remove redundant checks in PM save and restore
coresight: syscfg: Use IRQ-safe spinlock to protect active variables
coresight: Disable source helpers in coresight_disable_path()
coresight: Control path with range
coresight: Use helpers to fetch first and last nodes
coresight: Introduce coresight_enable_source() helper
coresight: Save active path for system tracers
coresight: etm4x: Set active path on target CPU
coresight: etm3x: Set active path on target CPU
coresight: sysfs: Use source's path pointer for path control
coresight: Control path during CPU idle
coresight: Add PM callbacks for sink device
coresight: sysfs: Increment refcount only for software source
coresight: Move CPU hotplug callbacks to core layer
coresight: sysfs: Validate CPU online status for per-CPU sources
Runyu Xiao (1):
coresight: etb10: restore atomic_t for shared reading state
Yabin Cui (1):
coresight: trbe: Save and restore state across CPU low power state
Yingchao Deng (1):
coresight: cti: Fix DT filter signals silently ignored
drivers/hwtracing/coresight/coresight-catu.c | 2 +-
drivers/hwtracing/coresight/coresight-core.c | 574 ++++++++++++++++++---
drivers/hwtracing/coresight/coresight-cti-core.c | 9 +-
.../hwtracing/coresight/coresight-cti-platform.c | 1 +
drivers/hwtracing/coresight/coresight-etb10.c | 6 +-
drivers/hwtracing/coresight/coresight-etm-perf.c | 289 ++++++-----
drivers/hwtracing/coresight/coresight-etm3x-core.c | 73 +--
drivers/hwtracing/coresight/coresight-etm4x-core.c | 216 +++-----
drivers/hwtracing/coresight/coresight-platform.c | 12 +-
drivers/hwtracing/coresight/coresight-priv.h | 8 +-
drivers/hwtracing/coresight/coresight-syscfg.c | 38 +-
drivers/hwtracing/coresight/coresight-syscfg.h | 2 +
drivers/hwtracing/coresight/coresight-sysfs.c | 135 ++---
drivers/hwtracing/coresight/coresight-tmc-etr.c | 4 +-
drivers/hwtracing/coresight/coresight-trbe.c | 61 ++-
drivers/hwtracing/coresight/ultrasoc-smb.c | 1 +
include/linux/coresight.h | 27 +-
include/linux/cpuhotplug.h | 2 +-
18 files changed, 939 insertions(+), 521 deletions(-)
Fix thread tracking when decoding Coresight trace and add a new test for
it.
The new test is added as a Perf test workload instead of a custom binary
with its own build system, but this requires a new feature in Perf test
to pass in control pipes which can enable and disable events. This
scopes the recording to just the workload and helps to reduce the amount
of data recorded in tracing tests.
With this new feature we can re-write all of the Coresight tests to make
use of it and remove the remaining binaries which fixes the following
issues:
* They didn't work in out of source builds
* A lot of the tests unnecessarily required root and didn't skip
without it
* They were mainly qualitative tests which didn't look for specific
behavior
Most importantly, the long build and runtime has been reduced. On a
Radxa Orion O6, unroll_loop_thread.c took 37s to compile which is longer
than the entire Perf build. Now the build time is negligible and the
before and after test runtimes for all the Coresight tests are:
| N1SDP | Orion O6
-----------------------------------
Before | 4m 0s | 14m 49s
After | 26s | 56s
-----------------------------------
Signed-off-by: James Clark <james.clark(a)linaro.org>
---
Changes in v2:
- Add --workload-ctl option to Perf test
- Re-write all the Coresight tests and speed them up
- Pass packet to memory access function so frontend can use either the
previous or current packet's EL
- Link to v1: https://lore.kernel.org/r/20260526-james-cs-context-tracking-fix-v1-0-ebd60…
---
James Clark (18):
perf cs-etm: Queue context packets for frontend
perf test: Add workload-ctl option
perf test: Add a workload that forces context switches
perf test cs-etm: Test process attribution
perf test: Add deterministic workload
perf test cs-etm: Replace unroll loop thread with deterministic decode test
perf test cs-etm: Remove asm_pure_loop test
perf test cs-etm: Replace memcpy test with raw dump stress test
perf test: Add named_threads workload
perf test cs-etm: Test decoding for concurrent threads test
perf test cs-etm: Remove duplicate branch tests
perf test cs-etm: Reduce snapshot size
perf test cs-etm: Speed up basic test
perf test cs-etm: Remove unused Coresight workloads
perf test cs-etm: Make disassembly test use kcore
perf test cs-etm: Add all branch instructions to test
perf test cs-etm: Speed up disassembly test
perf test cs-etm: Move existing tests to coresight folder
Documentation/trace/coresight/coresight-perf.rst | 78 +------
MAINTAINERS | 2 -
tools/perf/Documentation/perf-test.txt | 18 +-
tools/perf/Makefile.perf | 14 +-
tools/perf/scripts/python/arm-cs-trace-disasm.py | 20 +-
tools/perf/tests/builtin-test.c | 187 ++++++++++++++++-
tools/perf/tests/shell/coresight/Makefile | 29 ---
.../perf/tests/shell/coresight/Makefile.miniconfig | 14 --
tools/perf/tests/shell/coresight/asm_pure_loop.sh | 22 --
.../tests/shell/coresight/asm_pure_loop/.gitignore | 1 -
.../tests/shell/coresight/asm_pure_loop/Makefile | 34 ---
.../shell/coresight/asm_pure_loop/asm_pure_loop.S | 30 ---
.../tests/shell/coresight/concurrent_threads.sh | 45 ++++
.../tests/shell/coresight/context_switch_thread.sh | 69 ++++++
tools/perf/tests/shell/coresight/deterministic.sh | 71 +++++++
.../tests/shell/coresight/memcpy_thread/.gitignore | 1 -
.../tests/shell/coresight/memcpy_thread/Makefile | 33 ---
.../shell/coresight/memcpy_thread/memcpy_thread.c | 80 -------
.../tests/shell/coresight/memcpy_thread_16k_10.sh | 22 --
.../perf/tests/shell/coresight/raw_dump_stress.sh | 54 +++++
.../shell/{ => coresight}/test_arm_coresight.sh | 43 ++--
.../{ => coresight}/test_arm_coresight_disasm.sh | 17 +-
.../tests/shell/coresight/thread_loop/.gitignore | 1 -
.../tests/shell/coresight/thread_loop/Makefile | 33 ---
.../shell/coresight/thread_loop/thread_loop.c | 85 --------
.../shell/coresight/thread_loop_check_tid_10.sh | 23 --
.../shell/coresight/thread_loop_check_tid_2.sh | 23 --
.../shell/coresight/unroll_loop_thread/.gitignore | 1 -
.../shell/coresight/unroll_loop_thread/Makefile | 33 ---
.../unroll_loop_thread/unroll_loop_thread.c | 75 -------
.../tests/shell/coresight/unroll_loop_thread_10.sh | 22 --
tools/perf/tests/shell/lib/coresight.sh | 134 ------------
tools/perf/tests/tests.h | 3 +
tools/perf/tests/workloads/Build | 4 +
tools/perf/tests/workloads/context_switch_loop.c | 95 +++++++++
tools/perf/tests/workloads/deterministic.c | 39 ++++
tools/perf/tests/workloads/named_threads.c | 109 ++++++++++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 21 +-
tools/perf/util/cs-etm.c | 233 +++++++++++++--------
tools/perf/util/cs-etm.h | 8 +-
40 files changed, 892 insertions(+), 934 deletions(-)
---
base-commit: 5f0ca6b80b12bab1ce06839cdffb6148bb650ff4
change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
Best regards,
--
James Clark <james.clark(a)linaro.org>