Hello,
Sorry for the delay. I appreciate your posts.
I have recorded a different program now("ping 8.8.8.8"), and it seems that
decoding
the trace using the "ping" ELF file gives no issues now. I cannot explain
how "ls"
is the only corrupt trace(i rerecorded, same results). Perhaps the image is
indeed wrong.
I will check it further.
Thank you very much!
On Thu, Sep 20, 2018 at 10:42 AM, Mike Bazov <mike(a)perception-point.io>
wrote:
> Hello,
>
> Sorry for the delay. I appreciate your posts.
>
> I have recorded a different program now("ping 8.8.8.8"), and it seems that
> decoding
> the trace using the "ping" ELF file gives no issues now. I cannot explain
> how "ls"
> is the only corrupt trace(i rerecorded, same results). Perhaps the image
> is indeed wrong.
> I will check it further.
>
> Thank you very much!
>
> Mike.
>
>
> On Thu, Sep 20, 2018 at 1:28 AM, Mike Leach <mike.leach(a)linaro.org> wrote:
>
>> Hi Mike,
>>
>> I have looked into this issue further, found my previous assumption to
>> be wrong, and unfortunately have come to the conclusion that the
>> generated trace is somehow wrong / corrupt, or the supplied image is
>> not what was run when the trace was generated.
>>
>> If you look at the attached analysis of the trace generated from the
>> ls_api.cs data [analysis001.txt] This is at the very start of the
>> traced image.
>>
>> The first few packets [raw packets (0)] show the sync and start at
>> 00000000004003f0 <_start>:
>> followed by the first 'E' atom that marks the branch to 0x41a158. The
>> next two 'E' atoms get us to 0x41a028.
>>
>> At this point we get an exception packet, followed by a preferred
>> return address packet [ raw packets (2)].
>> This return address is 0x400630.
>>
>> The rules from the ETM architecture specification 4.0-4.4 p6-242 state:-
>>
>> "The Exception packet contains an address. This means that execution
>> has continued from the target of the most
>> recent P0 element, up to, but not including, that address, and a trace
>> analyzer must analyze each instruction in this
>> range."
>>
>> Thus the decoder is required to analyze from the previous P0 element -
>> the 'E' atom that marked the branch to 0x41a028, until the preferred
>> return address.
>> This is actually lower than the start address, which results in a huge
>> range seen here, and also seen by you in the example you described.
>> The decoder effectively runs off the end of the memory image before it
>> stops.
>>
>> The trace should be indicating an address after but relatively close
>> to 0x41a028 - as otherwise an atom would have been emitted by the cbnz
>> 41a054.
>>
>> If I examine the start of the perf_ls.cs decode, I see the same 3 'E'
>> atoms followed by the odd data fault exception.
>>
>> So for the first few branches at least, the perf and api captures go
>> in the same direction.
>>
>> Given the it is unlikely that the generated trace packets are
>> incorrect - it seems more likely that the 'ls' image being used for
>> decode is not what is generating this trace. Since we have to analyze
>> opcodes to follow the 'E' and 'N' atoms, decode relies on accurate
>> memory images being fed into the decoder. The only actual addresses we
>> have explicitly stated in the trace are the start: 0x4003f0, and the
>> exception return address 0x400360. The others are synthesized from the
>> supplied image.
>>
>> There may be a case for checking when decoding the exception packet
>> that the address is not behind the current location and throwing an
>> error, but beyond that I do not at this point believe that the decoder
>> is at fault.
>>
>> Regards
>>
>> Mike
>>
>>
>>
>> On 18 September 2018 at 19:32, Mike Leach <mike.leach(a)linaro.org> wrote:
>> > Hi Mike,
>> >
>> > I've looked further at this today, and can see a location where a
>> > large block appears in both the api and perf trace data on decode
>> > using the library test program.
>> >
>> > There does appear to be an issue if the decoder is in a "waiting for
>> > address state" i.e. it has lost track usually because an area of
>> > memory is unavailable, and an exception packet is seen - the exception
>> > address appears to be used twice - both to complete an address range
>> > and as an exception return - hence in this case the improbable large
>> > block. I need to look into this in more detail and fix it up.
>> >
>> > However - I am seeing before this the api and perf decodes have
>> > diverged, which suggests an issue elsewhere too perhaps. I do need to
>> > look deeper into this as well.
>> > I am not 100% certain that using the ls.bin as a full memory image at
>> > 0x400000 is necessarily working in the snapshot tests - there might be
>> > another offset needed to access the correct opcodes for the trace.
>> >
>> > I'll let you know if I make further progress.
>> >
>> >
>> > On 17 September 2018 at 16:53, Mike Leach <mike.leach(a)linaro.org>
>> wrote:
>> >> Hi Mike,
>> >>
>> >> I've looked at the data you supplied.
>> >>
>> >> I created test snapshot directories so that I could run each of the
>> >> trace data files through the trc_pkt_lister test program (the attached
>> >> .tgz file contains these, plus the results).
>> >>
>> >> Now the two trace files are different sizes - this is explained by the
>> >> fact that the api trace run had cycle counts switched on, whereas the
>> >> perf run did not - plus the perf run turned off the trace while in
>> >> kernel calls - the api left the trace on, though filtering out the
>> >> kernel - but a certain amount of sync packets have come through adding
>> >> to the size.
>> >>
>> >> Now looking at the results I cannot see the 0x4148f4 location in
>> >> either trace dump (perf_ls2.ppl and api_ls2.ppl in the .tgz).
>> >>
>> >> There are no obvious differences I could detect in the results, though
>> >> they are difficult to compare given the difference in output.
>> >>
>> >> The effect you are seeing does look like some sort of runaway - with
>> >> the decoder searching for opcodes - possibly in a section of the ls
>> >> binary file that does not contain executable code - till it happens
>> >> upon something that looks like an opcode.
>> >>
>> >> At this point I cannot explain the difference you and I are seeing
>> >> given the data provided. Can you look at the snapshot results, and see
>> >> if there is anything there? You can re-run the tests I ran if you
>> >> rename ls to ls.bin and put on level up from the ss-perf or ss-api
>> >> snapshot directories where the file is referenced to.
>> >>
>> >> Regards
>> >>
>> >> Mike
>> >>
>> >>
>> >>
>> >>
>> >> On 17 September 2018 at 13:44, Mike Bazov <mike(a)perception-point.io>
>> wrote:
>> >>> Greetings,
>> >>>
>> >>> I recorded the program "ls" (statically linked to provide a single
>> >>> executable as a memory accesses file).
>> >>>
>> >>> I recorded the program using perf, and then extracted the actual raw
>> trace
>> >>> data from the perf.data file using a little tool i wrote. I can use
>> OpenCSD
>> >>> to fully decode the trace produced by perf.
>> >>>
>> >>> I also recorded the "ls" util using an API i wrote from kernel mode. I
>> >>> published the API here as an [RFC]. Basically, i start recording and
>> stop
>> >>> recording whenever the __process__ of my interest is scheduling in.
>> >>> This post is not much about requesting a review for my API.. but i do
>> have
>> >>> some issues with the trace that is produced by this API, and i'm not
>> quite
>> >>> sure why.
>> >>>
>> >>> I use the OpenCSD directly in my code, and register a decoder
>> callback for
>> >>> every generic trace element. When my callback is called, i simply
>> print the
>> >>> element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
>> >>>
>> >>> Now, the weird thing is the perf and API produce the same generic
>> elements
>> >>> until a certain element:
>> >>>
>> >>> OCSD_GEN_TRC_ELEM_TRACE_ON()
>> >>> ...
>> >>> ...
>> >>> ... same elements...
>> >>> ... same elements...
>> >>> ... same elements...
>> >>> ...
>> >>> ...
>> >>>
>> >>> And eventually diverge from each other. I assume the perf trace is
>> going in
>> >>> the right direction, but my trace simply starts going nuts. The last
>> >>> __common__ generic element is the following:
>> >>>
>> >>> OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910]
>> (ISA=A64) E iBR
>> >>> A64:ret )
>> >>>
>> >>> After this element, perf trace goes in a different route, and the API
>> right
>> >>> afterwards produced a very weird instruction range element:
>> >>>
>> >>> OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20]
>> (ISA=A64) E ---
>> >>> )
>> >>>
>> >>> There is no way this 0x498a20 address was reached, and i cannot see
>> any
>> >>> proof for it in the trace itself(using ptm2human). It seems that the
>> decoder
>> >>> keeps decoding and disassembling opcodes until it reaches 0x498a20...
>> my
>> >>> memory callback(callback that is called if the decoder needs memory
>> that
>> >>> isn't present) is called for the address 0x498a20. From the on, the
>> trace
>> >>> just goes into a very weird path. I can't explain the address
>> branches that
>> >>> are taken from here on.
>> >>>
>> >>>
>> >>> Any ideas on how to approach this? OpenCSD experts would be
>> appreciated.
>> >>> I have attached the perf and API trace, and the "ls" executable which
>> is
>> >>> loaded into address 0x400000. I also attached the ETMv4 config for
>> every
>> >>> trace(trace id, etc..). There is no need to create multiple decoders
>> for
>> >>> different trace ids, theres only a single ID for a single decoder.
>> >>>
>> >>> Thanks,
>> >>> Mike.
>> >>>
>> >>> _______________________________________________
>> >>> CoreSight mailing list
>> >>> CoreSight(a)lists.linaro.org
>> >>> https://lists.linaro.org/mailman/listinfo/coresight
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Mike Leach
>> >> Principal Engineer, ARM Ltd.
>> >> Manchester Design Centre. UK
>> >
>> >
>> >
>> > --
>> > Mike Leach
>> > Principal Engineer, ARM Ltd.
>> > Manchester Design Centre. UK
>>
>>
>>
>> --
>> Mike Leach
>> Principal Engineer, ARM Ltd.
>> Manchester Design Centre. UK
>>
>
>
Updated library:
- fixes bug with generic exception packets being set with wrong
exception number in ETMv4
- updates docs with latest AutoFDO instructions, and record.sh script
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK
Greetings,
I recorded the program "ls" (statically linked to provide a single
executable as a memory accesses file).
I recorded the program using perf, and then extracted the actual raw trace
data from the perf.data file using a little tool i wrote. I can use OpenCSD
to fully decode the trace produced by perf.
I also recorded the "ls" util using an API i wrote from kernel mode. I
published the API here as an* [RFC]*. Basically, i start recording and stop
recording whenever the __process__ of my interest is scheduling in.
This post is not much about requesting a review for my API.. but i do have
some issues with the trace that is produced by this API, and i'm not quite
sure why.
I use the OpenCSD directly in my code, and register a decoder callback for
every generic trace element. When my callback is called, i simply print the
element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
Now, the weird thing is the perf and API produce the same generic elements
until a certain element:
OCSD_GEN_TRC_ELEM_TRACE_ON()
...
...
... same elements...
... same elements...
... same elements...
...
...
And eventually diverge from each other. I assume the perf trace is going in
the right direction, but my trace simply starts going nuts. The last
*__common__* generic element is the following:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910] (ISA=A64) E
iBR A64:ret )
After this element, perf trace goes in a different route, and the API right
afterwards produced a very *weird *instruction range element:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20] (ISA=A64) E
--- )
There is no way this 0x498a20 address was reached, and i cannot see any
proof for it in the trace itself(using ptm2human). It seems that the
decoder keeps decoding and disassembling opcodes until it reaches 0x498a20...
my memory callback(callback that is called if the decoder needs memory that
isn't present) is called for the address 0x498a20. From the on, the trace
just goes into a very weird path. I can't explain the address branches that
are taken from here on.
Any ideas on how to approach this? OpenCSD experts would be appreciated.
I have attached the perf and API trace, and the "ls" executable which is
loaded into address 0x400000. I also attached the ETMv4 config for every
trace(trace id, etc..). There is no need to create multiple decoders for
different trace ids, theres only a single ID for a single decoder.
Thanks,
Mike.
Hi,
Attending the Hardware Trace on Linux talk at Linaro Connect the topic of ARMv7 support for trace decoding came up.
What is the current state and are there patches available somewhere?
--
Stefan
This patch set is to explore Coresight tracing data for postmortem
debugging. When kernel panic happens, the Coresight panic kdump can
help to save on-chip tracing data and tracer metadata into DRAM, later
relies on kdump and crash/perf tools to recovery tracing data for
"offline" analysis.
Comparing the patch series v4 and previous series, this patch series
has heavily refactored the implementation after investigated Intel PT
for kdump support. Intel PT calls one function for emergency stopping
trace when kernel panic occurs, in the function it reuses perf operation
to dump trace data into ring buffer, later crash tool extracts trace
data from perf ring buffer.
This patch series takes Intel PT as an example to use the same way to
stop ETM trace with perf mode. So far the related work is primarily
to focus on to support Coresight kdump with perf mode and we can add
support SysFS mode if later there have more clear requirement.
Comparing to previous series, this patch series also simplifies the
handling for tracer metadata. The old series introduced extra data
structure and two double link lists to maintain CoreSignt kdump
components; in the old implementation, one list was used to track tracer
metadata and another list was used to trace dump buffers, later these
two lists can be used to retrieve metadata and trace data buffer from
vmcore file. In this patch series it directly relies on CoreSight
driver global variables to retrieve related info, e.g. for perf mode we
can rely on per CPU pointer 'ctx_handle' to get perf ring buffer related
info and 'csdev_src' is for per CPU tracer device structure for metadata.
The crash extension program now has been enhanced to parse the data
structures in the kernel and use them to extract metadata and dump trace
data [1]; the crash extension program is updated to build with OpenCSD
decoder so this can simplize the decoding process, rather than before
needs to use perf to help decoding trace data.
This patch series has been verified on 96boards DB410c with below steps,
the 'long_loop' is a pretty simple program to only execute big number
loops so can generate big amount number of branch instructions.
Enable trace on the target board:
$ perf record -e cs_etm/(a)825000.etf/ --per-thread ./long_loop &
$ sleep 3
$ echo c > /proc/sysrq-trigger
Use crash tool for post analysis:
$ crash vmcore vmlinux
crash> extend arm_cs_dump.so
crash> arm_cs_dump -o out
[1] https://git.linaro.org/people/leo.yan/crash.git/log/?h=arm_cs_dump_etm_perf
Changes from v4:
* Support for CoreSight ETM with perf mode;
* Add API for crash stop;
* Simplized implementation with removing kdump dedicated data structures
and functions;
Changes from v3:
* Following Mathieu suggestion, reworked the panic kdump framework,
used kdump array to maintain source and sink device handlers;
* According to Mathieu suggestion, optimized panic notifier to
firstly dump panic CPU tracing data and then dump other CPUs tracing
data;
* Refined doc to reflect these implementation changes;
* Changed ETMv4 driver to add source device handler at probe phase;
* Refactored crash extension program to reflect kernel changes.
Changes from v2:
* Add the two patches for documentation.
* Following Mathieu suggestion, reworked the panic kdump framework,
removed the useless flag "PRE_PANIC".
* According to comment, changed to add and delete kdump node operations
in sink enable/disable functions;
* According to Mathieu suggestion, handle kdump node
addition/deletion/updating separately for sysFS interface and perf
method.
Changes from v1:
* Add support to dump ETMv4 meta data.
* Wrote 'crash' extension csdump.so so rely on it to generate 'perf'
format compatible file.
* Refactored panic dump driver to support pre & post panic dump.
Changes from RFC:
* Follow Mathieu's suggestion, use general framework to support dump
functionality.
* Changed to use perf to analyse trace data.
Leo Yan (6):
doc: Add Coresight documentation directory
doc: Add documentation for Coresight panic kdump
coresight: etm4x: Save ID values in config structure
coresight: tmc: Update latest value for page index and offset
coresight: etm-perf: Add interface to stop etm trace
arm64: smp: Stop CoreSight trace for kdump
.../trace/{ => coresight}/coresight-cpu-debug.txt | 0
.../trace/coresight/coresight-panic-kdump.txt | 99 ++++++++++++++++++++++
Documentation/trace/{ => coresight}/coresight.txt | 0
MAINTAINERS | 5 +-
arch/arm64/kernel/smp.c | 5 ++
drivers/hwtracing/coresight/Kconfig | 10 +++
drivers/hwtracing/coresight/coresight-etm-perf.c | 10 +++
drivers/hwtracing/coresight/coresight-etm4x.c | 7 ++
drivers/hwtracing/coresight/coresight-etm4x.h | 8 ++
drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 ++
include/linux/coresight.h | 6 ++
11 files changed, 156 insertions(+), 2 deletions(-)
rename Documentation/trace/{ => coresight}/coresight-cpu-debug.txt (100%)
create mode 100644 Documentation/trace/coresight/coresight-panic-kdump.txt
rename Documentation/trace/{ => coresight}/coresight.txt (100%)
--
2.7.4
Coresight architecture defines CLAIM tags for a device to negotiate
control of the components (external agent vs self-hosted). Each device
has a pair of registers (CLAIMSET & CLAIMCLR) for managing the CLAIM
tags. However, the protocol for the CLAIM tags is IMPLEMENTATION DEFINED.
PSCI has recommendations for the use of the CLAIM tags to negotiate
controls for external agent vs self-hosted use, as defined in
ARM DEN 0022D, Section "6.8.1 Debug and Trace save and restore".
This series implements the recommended protocol by PSCI.
There were two options for the implementation.
1) Have the claim/disclaim operations performed from the coresight
generic driver - This doesn't work unfortunately for ETM devices
as the need cross-CPU calls to access the CLAIM registers. Also,
makes it complex for error recovery and reference counting.
2) Have the claim/disclaim operations performed from the device
specific drivers. The disadvantage is that the calls are sprinkled
in each driver, but this makes the operation much simpler.
This series implements the method (2). The first part of the series
prepares different drivers to handle errors from the lower layer
and clean up the state. The second part of the series updates the
existing drivers to claim/disclaim the devices as necessary.
Tested with a hacked coresight driver which modifies the external
claim tag via sysfs handle.
Applies on coresight/next in Mathieu's tree.
Changes since V1:
- Handle errors is enabling path and disable only the components
that were enabled in the iteration.
- Fix build break on arm32 (etm3x)
- Update commit description for "coresight: Add support for CLAIM tag protocol"
Suzuki K Poulose (14):
coresight: Handle failures in enabling a trace path
coresight: tmc-etr: Refactor for handling errors
coresight: tmc-etr: Handle errors enabling CATU
coresight: tmc-etb/etf: Prepare to handle errors enabling
coresight: etm4x: Add support for handling errors
coresight: etm3: Add support for handling errors
coresight: etb10: Handle errors enabling the device
coresight: dynamic-replicator: Handle multiple connections
coresight: Add support for CLAIM tag protocol
coresight: etmx: Claim devices before use
coresight: funnel: Claim devices before use
coresight: catu: Claim device before use
coresight: dynamic-replicator: Claim device for use
coreisght: tmc: Claim device before use
drivers/hwtracing/coresight/coresight-catu.c | 6 ++
.../coresight/coresight-dynamic-replicator.c | 79 ++++++++++----
drivers/hwtracing/coresight/coresight-etb10.c | 18 +++-
drivers/hwtracing/coresight/coresight-etm3x.c | 56 +++++++---
drivers/hwtracing/coresight/coresight-etm4x.c | 51 ++++++---
drivers/hwtracing/coresight/coresight-funnel.c | 26 ++++-
drivers/hwtracing/coresight/coresight-priv.h | 7 ++
drivers/hwtracing/coresight/coresight-tmc-etf.c | 95 +++++++++++------
drivers/hwtracing/coresight/coresight-tmc-etr.c | 80 +++++++++-----
drivers/hwtracing/coresight/coresight.c | 118 +++++++++++++++++++--
include/linux/coresight.h | 20 ++++
11 files changed, 434 insertions(+), 122 deletions(-)
--
2.7.4
This patch series is to two fixing of updating ring buffer in tmc-etf
driver. The first patch is to fix alignment setting for RRP; the second
patch tries to fix discarding trace data issue caused by filling
barrier packets in the same place, the patch keeps complete trace data
with inserting extra barrier packets.
This patch series has been rebased on CoreSight next branch:
https://git.linaro.org/kernel/coresight.git/log/?h=next with latest
commit 3733ca5a6578 ("coresight: tmc: Refactor loops in etb dump").
Changes from v1:
* Rebased on CoreSight next branch (Sept 11th, 2018);
* Added checking 'lost || to_read > handle->size' to set 'barrier_sz'.
Leo Yan (2):
coresight: tmc: Fix byte-address alignment for RRP
coresight: tmc: Fix writing barrier packets for ring buffer
drivers/hwtracing/coresight/coresight-tmc-etf.c | 41 +++++++++++++++++--------
1 file changed, 29 insertions(+), 12 deletions(-)
--
2.7.4
We do not enable scatter-gather mode in the TMC-ETR by default
to prevent malfunctioning of systems where the ETR may not be
properly connected to the memory subsystem to allow for simultaneous
READ/WRITE transactions when used in SG mode. Instead we whitelist
the platforms where we know that it is safe to use the mode.
All revisions of Juno have a proper ETR connection and hence
white list them.
Cc: Mathieu Poirier <mathieu.poirier(a)linaro.org>
Cc: Mike Leach <mike.leach(a)linaro.org>
Cc: Sudeep Holla <sudeep.holla(a)arm.com>
Cc: Liviu Dudau <liviu.dudau(a)arm.com>
Cc: Lorenzo Pierlisi <lorenzo.pieralisi(a)arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
---
arch/arm64/boot/dts/arm/juno-base.dtsi | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/boot/dts/arm/juno-base.dtsi b/arch/arm64/boot/dts/arm/juno-base.dtsi
index ce56a4a..3596e5d 100644
--- a/arch/arm64/boot/dts/arm/juno-base.dtsi
+++ b/arch/arm64/boot/dts/arm/juno-base.dtsi
@@ -199,6 +199,7 @@
clocks = <&soc_smc50mhz>;
clock-names = "apb_pclk";
power-domains = <&scpi_devpd 0>;
+ arm,scatter-gather;
port {
etr_in_port: endpoint {
slave-mode;
--
2.7.4