Hello Mathieu
We also like to trace minor page fault in the kernel when a user thread hits a page fault.
I am using the following perf command to start the trace for a given thread.
perf record -e cs_etm/(a)8008046000.etf/ --per-thread --pid 2231&
Trace captured has only user space activity and I do not see any kernel trace.
Is this the correct perf command?
Regards, Reza
Hi Thierry,
I see you have also sent this mail to Mathieu who has answered some of the points and cc:ed the linaro coresight mailing list.
I'll give you my spin on a couple of things here.....
> -----Original Message-----
> From: Thierry Laviron
> Sent: 29 June 2017 15:45
> To: Mike Leach
> Subject: Using Coresight in SysFS mode on Juno board
>
> Hi Mike,
>
>
>
> I am currently trying to get trace data using the CoreSight system in SysFS
> mode on my Juno r2 board.
>
>
>
> I found some documentation on how to use it in the
> Documentation/trace/coresight.txt file of the perf-opencsd-4.11 branch of the
> OpenCSD repository.
>
>
>
> This document says that I can retrieve the trace data from /dev/ using dd, for
> example in my case that would be
>
> root@juno-debian:~# dd if=/dev/20070000.etr of=~/cstrace.bin
>
>
>
> However, I am assuming this produces a dump of the memory buffer as it was
> when I stopped trace collection,
>
> And that I do not have the full trace data generated (because it does not fit on
> the buffer).
>
> I would like to be able to capture a continuous stream of data from the ETR, but
> did not find how should I do that.
>
It is not possible to read trace while still collecting it - the process you are tracing must be stopped while trace is saved. Perf can achieve this as it is integrated into the kernel, but this is difficult to achieve from the sysfs interface.
As Mathieu says, you need to limit the amount of trace to the application you are tracing - but even so, the rate of trace collection can easily overflow buffers.
>
>
> I am writing a C program. Can I open a read access to the ETR buffer like this?
>
> open("/dev/20070000.etr", O_RDONLY);
>
>
>
> and then read its content, to write somewhere else? (e.g. to a file on the disc)?
>
>
>
> As a second step, I am also trying to filter the trace generated. I found some
> useful documentation in
>
> Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x
>
> However, while this is very useful to understand what are the purpose of the
> different files that appear in the
>
> /sys/bus/coresight/devices/<mmap>.etm/ folders, I am not sure of the format
> to put stuff in.
>
>
>
> For example, I want to use the Context ID comparator, so the ETM traces only
> the process I am interested in.
>
> I assume I need to write the PID of my process in ctxid_pid, probably write 0x1
> in ctxid_idx to activate it, and leave 0x0 in ctxid_mask
>
> according to the ETM v4.3 architecture specification.
>
> But I feel that I am missing something else, as it seems the ETM is not taking
> the filter into account.
>
i) you will need to have enabled PID=>context idr tracking in your kernel.
ii) you need to set up the ViewInst event resource selector to select a context ID event to start and stop the trace, in addition to setting the context ID comparators.
Additionally you will need some address range enabled as well - though by default the etm drivers set up the full address range under sysfs.
The hardware registers needed for all this are described in the ETM TRM, but at present I don't know of any docs that map the sysfs names onto the relevant HW registers.
Regards
Mike
>
>
> If there is more relevant documentation on this that I have not found, I would
> appreciate if you could point me to it.
>
> If not, and what I am trying to do will not work, I would welcome some advice
> on how to do it properly.
>
>
>
> Thanks in advance.
>
>
>
> Best regards,
>
>
>
> Thierry Laviron
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 11 July 2017 at 09:25, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
> Hello Mathew
>
>
>
> In our platform we have a few clusters each has its own funnel and ETF.
> There is no ETR. Each cluster has 4 ETMs.
>
> How can I enable the trace for all the clusters? Does the following commands
> only enables the trace in one cluster?
>
> How can I enable trace for all the clusters?
>
>
>
>
>
> perf record -e cs_etm/(a)8008010000.etf/ --per-thread uname
>
> perf record -e cs_etm/(a)8008030000.etf/ --per-thread uname
>
> perf record -e cs_etm/(a)8008050000.etf/ --per-thread uname
>
> perf record -e cs_etm/(a)8008070000.etf/ --per-thread uname
Hi Reza,
I'm adding the CoreSight mailing list, there is a lot of knowledgeable
people on there that can help you if I'm not around or don't know the
answer to your questions. I suggest you CC the list when you need
information.
>From the above description I deduce the platform doesn't have a common
sink for all the tracers available on the board - instead it has an
ETF for each cluster. When I wrote the CS drivers I could foresee
this kind of topology would show up one day but simply didn't have any
HW to test on. As such it was written with the assumption that all
tracers have a common sink. Yours is the second platform I'm being
told about where my initial assumptions don't hold anymore.
Unfortunately there is no way to enable traces for all clusters with
the current implementation. I have plans to work on it though... For
now you will need to use the taskset utility to confine an application
to specific processors. So something like:
# perf record -e cs_etm/(a)8008010000.etf/ --per-thread taskset 0x($MASK) uname
will do the trick.
Thanks,
Mathieu
>
>
>
>
>
> Regards, Reza
On 29 June 2017 at 05:46, Leo Yan <leo.yan(a)linaro.org> wrote:
> Hi Mathieu, Mike,
Good morning Leo,
>
> Guodong and me have planning to enable coresight on Hikey960, but we
> are not quite sure if you have requirement for this or not.
I currently don't have the bandwidth to work on this.
> Due
> Guodong told me so far community has no many inputs for coresight
> enabling on Hikey960, so want to check with you if you are
> intreseting on this platform for coresight works, if you think there
> have strong requirement we can start related enabling with Hisilicon.
I would be delighted to have CS support on Hikey960 - current
platforms are well supported but the passage of time can't be ignored.
>
> Methieu/Chunyan before have took much efforts to enable Hikey, you
> could see Hikey960 we still have very poor doc for coresight module in
> below section; so if you think this is important for your work, Guodong
> and me will sync with Hisilicon for coresight enabling ASAP (we heavily
> depend Hisilicon to provide info for clock and coresight topology),
> from Hikey experience this took very long time, but we can summary the
> check points based on previous experience and accelerate a bit for
> this (if necessary, I'm glad to work in Hisilicon lab to enable it).
>
> If you think this platform is redundant with others, I still will send
> to Hisilicon and will take it as a low priority task.
I don't think it's redundant at all...
Before you start implementing anything I'd like to see the CoreSight
topology for this board. Newer design are getting more creative and
there may be cases we haven't expected in the initial design. If
that's the case I'll spot them right away and offer ways to address
the problems.
Regards,
Mathieu
>
> -----
> 2.7.2 CoreSight Debugging
> The Hi3660 has a powerful debug system that integrates an ARM
> CoreSight system. The CoreSight system supports the following features:
> - Top-level CoreSight and local CoreSight in each cluster. The local
> CoreSight contains the A73 CoreSight and A53 CoreSight.
> - Intrusive debugging (debug) and non-intrusive debugging (trace)
> A73 and A53 support both debug and trace.
> - Software debugging and traditional JTAG debugging
>
> Thanks,
> Leo Yan
Good day Thierry,
On 29 June 2017 at 03:09, Thierry Laviron <Thierry.Laviron(a)arm.com> wrote:
> Hi Mathieu,
>
>
>
> I am currently trying to get trace data using the CoreSight system in SysFS
> mode on my Juno r2 board.
>
>
>
> I found some documentation on how to use it in the
> Documentation/trace/coresight.txt file of the perf-opencsd-4.11 branch of
> the OpenCSD repository.
>
>
>
> This document says that I can retrieve the trace data from /dev/ using dd,
> for example in my case that would be
>
> root@juno-debian:~# dd if=/dev/20070000.etr of=~/cstrace.bin
>
>
>
> However, I am assuming this produces a dump of the memory buffer as it was
> when I stopped trace collection,
That is correct.
>
> And that I do not have the full trace data generated (because it does not
> fit on the buffer).
Also correct. If there was a buffer overflow then you'll only get the
latest trace data.
>
> I would like to be able to capture a continuous stream of data from the ETR,
> but did not find how should I do that.
>
Currently the only way to do that is to use coresight from the perf
interface (see HOWTO.md on github).
>
>
> I am writing a C program. Can I open a read access to the ETR buffer like
> this?
>
> open(“/dev/20070000.etr”, O_RDONLY);
So simply have a read() or a select() blocking on the file descriptor,
waiting for trace data to be produced and consuming it as it is
generated?
>
>
>
> and then read its content, or pipe it somewhere else (e.g. to a file on the
> disc)?
Unfortunately no.
>
>
>
> If there is more relevant documentation on this that I have not found, I
> would appreciate if you could point me to it.
>
> If not, and what I am trying to do will not work, I would welcome some
> advice on how to do it properly.
You are raising an interesting scenario that hasn't occurred before.
When operating from sysFS the problem is to program the tracers to
reduce the amount of traces generated. Otherwise userspace can't
possibly cope and you'd end up with buffer overflows. But let's
assume you got that part covered there is still a problem of when to
move trace data from the ETR buffer (contiguous or SG list) to the
buffer conveyed by read/select(). That is a tedious problem that
currently doesn't have a solution.
As I said earlier this is a compelling use case. As such I am coping
the coresight mailing list along with Mike and Suzuki. Someone might
have some interest in working on this or some thoughts on how to
address the issue. It's even better if you want to offer a solution -
we'll be happy to provide help and support.
Thanks,
Mathieu
>
>
>
> Thanks in advance.
>
>
>
> Best regards,
>
>
>
> Thierry Laviron
>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
On Fri, 26 May 2017 14:12:21 +0100 Mike Leach wrote:
> Hi,
>
> Tried out Sebastians patches and got some similarities to Kim but a
> couple of differences and some interesting results if you look at the
> disassemble of the resulting routines.
>
> So as per the AutoFDO instructions I built a sort program with no
> optimisations and debug:
> gcc -g sort.c -o sort
> This I profiled on juno- with 3000 interations
>
> The resulting disassembly of the bubble_sort routine is in
> bubble-sorts-disass.txt and the dump gcov profile is below...
> --------------------------------
> bubble_sort total:33987051 head:0
> 0: 0
> 1: 0
> 2: 2839
> 3: 2839
> 4: 2839
> 4.1: 8522673
> 4.2: 8519834
> 5: 8517035
> 6: 2104748
> 7: 2104748
> 8: 2104748
> 9: 2104748
> 13: 0
> -------------------------------
> So in my view - the swap lines (6:-9:) - see attached sort.c, are run
> less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim
> observed with the intel version.
> The synthesized LBR records looked reasonable from comparison with the
> disassembly too.
>
> Trying out the O3 and O3-autofdo from this profile resulted in O3
> running marginally faster, but both faster than unoptimised debug.
>
> So now look at the disassemblies from the -O3 and -autofdo-O3 versions
> of the sort routine [bubble-sorts-disass.txt again]. Both appear to
> define a bubble_sort routine, but embed the same / similar code into
> sort_array.
> Unsurprisingly the O3 version is considerably more compact - hence it
> runs faster. I have no idea what the autofdo version is up to, but the
> I cannot see how the massive expansion of the routine with compare and
> jump tables is going to help.
>
> So perhaps:-
> 1) the LBR stacks are still not correct - though code and dump
> inspection might suggest otherwise - are there features in the intel
> LBR we are not yet synthesizing?
> 2) There is some adverse interaction with the profiles we are
> generating and the autofdo code generation.
> 3) The amount of coverage in the file is hitting the process - looking
> at gcov above then we only have trace from the bubble sort routine. I
> did reduce the number of iterations to get more of the program
> captured in coverage but this did not seem to make a difference.
> Mike
Apologies for the delay in replying to this.
Some further thoughts on this.
1) This is not an apples-to-apples comparison. The baseline code will most likely have different optimizations applied for x86-64, which will give rise to different code paths and so different profiles. Also is someone here able to comment on to what extent the optimizations applied by the "autofdo-O3" compiler are machine independent?
I assume that the work done to create that flow has been done on an x86 version of the compiler, and it might be that regressions exist in the A64 compiler that do not exist in x86: I don't know. For example, the unrolling done for the sort.c example might not be a suitable optimization for the target CPU.
This isn't a real-world code example. Bubble sort is sorting random data, so at its heart is an unpredictable compare-and-swap check, and a small inner-loop. The unrolled code, on the other hand, contains many unpredictable branches. It would be better to reproduce this experiment, if not on real-world code then at least on a more sensible benchmark.
2) AIUI, "perf inject --itrace" on the ETM uses systematic block-based sampling to break the trace into LBR records. (That is, after N trace block records it creates a sample with an LBR attached, where a trace block represents a sequence of instructions between two waypoints.) E.g. "perf inject --itrace=il64"
Conversely, also AIUI, the reference method for doing this with Intel PT samples based on a reconstructed view of time. (That is, every N reconstructed clock periods, it creates a sample with an LBR attached.) E.g. "perf inject --itrace=i100usle".
Time-based sampling will generate more samples from code hot spots, where a hot spot is defined as where *time* is spent in the program. The ETM flow will also favour hot spots, obviously, because these will appear more in the trace. However, because the sampling is not time-based, each *range* is as likely to be sampled as any other range.
E.g. if there is a short code sequence that executes in 10 clock periods and a long sequence that executes in 100 clock periods, and both appear equally often in the code, then using time-based sampling the former will appear 10x less often than the latter, but using systematic block-based sampling they appear at the same rate.
Furthermore, from a cursory look at the Intel PT code, it looks to me like the Intel PT perf driver walks through each block, instruction by instruction. If I understand this correctly, then that means that even if sampling were systematic and instruction-based rather than time-based (e.g. would "--itrace=i64i" do this on PT?), then the population for sampling is instructions rather than blocks, and again won't match what cs-etm.c is doing.
E.g. if the short code sequence is 10 instructions and the long sequence is 100 instructions, then with systematic instruction-based sampling the former block will appear 10x less often in the code, whereas with systematic block-based sampling, they appear at the same rate.
One could hack the Intel PT inject tool to implement the same kind of block-based sampling, and see what effect this has (assuming there is a good reason why the ETM inject doesn't implement the time-based sampling -- I've not investigated this). If you have such a sample you can also use the profile_diff tool from AutoFDO to compare the shape of the samples.
Now, the extent to which this affects the compiler I do not know. E.g. both sampling schemes are OK for telling a compiler which branches are taken, but if the compiler thinks the samples are time-based and so represent code hotspots, then systematic block-based sampling would be misleading.
Mike.
> On 25 May 2017 at 05:12, Kim Phillips <kim.phillips at arm.com> wrote:
> > On Wed, 24 May 2017 12:48:04 -0500
> > Sebastian Pop <sebpop at gmail.com> wrote:
> >
> >> On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier
> >> <mathieu.poirier at linaro.org> wrote:
> >> > Are the instructions in the autoFDO section of the HOWTO.md on
> GitHub sufficient
> >> > to test this or there is another way?
> >>
> >> Here is how I tested it: (supposing that perf.data contains an ETM
> trace)
> >>
> >> # perf inject -i perf.data -o inj --itrace=il64 --strip
> >> # perf report -i inj -D &> dump
> >>
> >> and I inspected the addresses from the last branch stack in the output
> dump
> >> with the addresses of the disassembled program from:
> >>
> >> # objdump -d sort
> >
> > Re-running the AutoFDO process with these two patches continue to make
> > the resultant executable perform worse, however:
> >
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5306 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5304 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5851 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5889 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5888 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5318 ms
> >
> > The gcov file generated from the inj.data (no matter whether it's
> > --itrace=il64 or --itrace=i100usle) still looks wrong:
> >
> > $ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov
> > sort_array total:19309128 head:0
> > 0: 0
> > 1: 0
> > 5: 0
> > 6: 0
> > 7.1: 0
> > 7.3: 0
> > 8.3: 0
> > 15: 2
> > 16: 2
> > 17: 2
> > 10: start total:0
> > 1: 0
> > 11: bubble_sort total:19309119
> > 2: 1566
> > 4: 6266668
> > 5: 6071341
> > 7: 6266668
> > 9: 702876
> > 12: stop total:3
> > 2: 0
> > 3: 1
> > 4: 1
> > 5: 1
> > main total:1 head:0
> > 0: 0
> > 2: 0
> > 4: 1
> > 1: cmd_line total:0
> > 3: 0
> > 4: 0
> > 5: 0
> > 6: 0
> >
> > Whereas the one generated by intel-pt run looks correct, showing the
> > swap (11: bubble_sort 7,8) as executed less times:
> >
> > kim at juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort-
> O3.gcov
> > sort_array total:105658 head:0
> > 0: 0
> > 5: 0
> > 6: 0
> > 7.1: 0
> > 7.3: 0
> > 8.3: 0
> > 16: 0
> > 17: 0
> > 1: printf total:0
> > 2: 0
> > 10: start total:0
> > 1: 0
> > 11: bubble_sort total:105658
> > 2: 14
> > 4: 28740
> > 5: 28628
> > 7: 9768
> > 8: 9768
> > 9: 28740
> > 12: stop total:0
> > 2: 0
> > 3: 0
> > 4: 0
> > 5: printf total:0
> > 2: 0
> > 15: printf total:0
> > 2: 0
> >
> > I have to run the 'perf inject' on the x86 host because of the
> > aforementioned:
> >
> > 0x350 [0x50]: failed to process type: 1
> >
> > problem when trying to run it natively on the aarch64 target.
> >
> > However, it doesn't matter whether I run the create_gcov - like so btw:
> >
> > ~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --
> gcov=sort-O3.gcov -gcov_version=1
> >
> > on the x86 host or the aarch64 target: I still get the same (negative
> > performance) results.
> >
> > As Sebastian asked, if I take the intel-pt sourced inject
> > generated .gcov onto the target and rebuild sort, the performance
> > improves:
> >
> > $ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-
> autofdo
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5309 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5310 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 4443 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 4443 ms
> >
> > And if I take the ETM-generated gcov and use that to build a new x86_64
> > binary, it indeed performs worse on x86_64 also:
> >
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1502 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1500 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1501 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1907 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1893 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1907 ms
> >
> > Kim
> > _______________________________________________
> > CoreSight mailing list
> > CoreSight at lists.linaro.org
> > https://lists.linaro.org/mailman/listinfo/coresight
>
>
>
> --
> Mike Leach
> Principal Engineer, ARM Ltd.
> Blackburn Design Centre. UK
<snip>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Adds in call to decode library to activate the barrier packet
detection option.
Adds in additional per trace source info to associate CS trace ID with
incoming stream and dump ID info.
Adds in compile time option to dump raw trace data and packed trace
frames for debugging trace issues.
Updates for v2:
Per: mpoirier...
1/3 Update comment to explain FSYNC 4x flag.
2/3 Change to use struct list_head as base of list for trace IDs.
Merge in change to "RESET DECODER" message from v1 3/3 patch.
3/3 Create init_raw func to combine conditionally compiled code into
single block.
Mike Leach (3):
perf: cs-etm: Active barrier packet option in decoder.
perf: cs-etm: Add channel context item to track packet sources.
perf: cs-etm: Add options to log raw trace data for debug.
tools/perf/Makefile.config | 6 ++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 122 +++++++++++++++++++++++-
2 files changed, 123 insertions(+), 5 deletions(-)
--
2.7.4
Adds in call to decode library to activate the barrier packet
detection option.
Adds in additional per trace source info to associate CS trace ID with
incoming stream and dump ID info.
Adds in compile time option to dump raw trace data and packed trace
frames for debugging trace issues.
Mike Leach (3):
perf: cs-etm: Active barrier packet option in decoder.
perf: cs-etm: Add channel context item to track packet sources.
perf: cs-etm: Add options to log raw trace data for debug.
tools/perf/Makefile.config | 6 ++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 108 ++++++++++++++++++++++--
2 files changed, 109 insertions(+), 5 deletions(-)
--
2.7.4