This is a series of fixes addressing the issues in the way we handle
- Self-Hosted trace filter control register for ETM/ETE
- AUX buffer and event handling of TRBE at overflow.
The use of TRUNCATED flag on an IRQ for the TRBE driver is
something that needs to be rexamined. Please see Patch 3 for
more details.
Suzuki K Poulose (5):
coresight: etm4x: Save restore TRFCR_EL1
coresight: etm4x: Use Trace Filtering controls dynamically
coresight: trbe: Keep TRBE disabled on overflow irq
coresight: trbe: Move irq_work queue processing
coresight: trbe: Prohibit tracing while handling an IRQ
.../coresight/coresight-etm4x-core.c | 109 ++++++++++++++----
drivers/hwtracing/coresight/coresight-etm4x.h | 7 +-
drivers/hwtracing/coresight/coresight-trbe.c | 91 ++++++++++-----
3 files changed, 149 insertions(+), 58 deletions(-)
--
2.24.1
HI James,
Agreed the header is missing from the install. Looking in more detail
the header should also be part of ocsd_c_api_types.h
so that it gets pulled in by any program including the main api header.
I'll look at an update to fix both these issues.
Normally these should come through the CS mailing list, or as an issue
on the git hub project.
Thanks
Mike
On Fri, 16 Jul 2021 at 14:56, James Clark <james.clark(a)arm.com> wrote:
>
> To be able to instantiate the OCSD_BUILTIN_DCD_ETE decoder the header
> trc_pkt_types_ete.h is required, so install it alongside the headers for
> the other versions.
>
> Signed-off-by: James Clark <james.clark(a)arm.com>
> ---
> decoder/build/linux/rctdl_c_api_lib/makefile | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/decoder/build/linux/rctdl_c_api_lib/makefile b/decoder/build/linux/rctdl_c_api_lib/makefile
> index a0bd5a3..7b4055d 100644
> --- a/decoder/build/linux/rctdl_c_api_lib/makefile
> +++ b/decoder/build/linux/rctdl_c_api_lib/makefile
> @@ -113,6 +113,8 @@ install_inc:
> $(INSTALL) --mode=0644 $(INST_INC_SRC)/etmv3/trc_pkt_types_etmv3.h $(INST_INC_DST)/etmv3/
> $(INSTALL) -d --mode=0755 $(INST_INC_DST)/etmv4
> $(INSTALL) --mode=0644 $(INST_INC_SRC)/etmv4/trc_pkt_types_etmv4.h $(INST_INC_DST)/etmv4/
> + $(INSTALL) -d --mode=0755 $(INST_INC_DST)/ete
> + $(INSTALL) --mode=0644 $(INST_INC_SRC)/ete/trc_pkt_types_ete.h $(INST_INC_DST)/ete/
> $(INSTALL) -d --mode=0755 $(INST_INC_DST)/c_api
> $(INSTALL) --mode=0644 $(INST_INC_SRC)/c_api/ocsd_c_api_types.h $(INST_INC_DST)/c_api/
> $(INSTALL) --mode=0644 $(INST_INC_SRC)/c_api/opencsd_c_api.h $(INST_INC_DST)/c_api/
> --
> 2.28.0
>
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK
On 15/07/2021 04:09, Anshuman Khandual wrote:
> A small nit. Paragraphs in the commit message do not seem to be aligned
> properly to a maximum 75 characters width.
>
> On 7/12/21 5:08 PM, Suzuki K Poulose wrote:
>> When the TRBE generates an IRQ, we stop the TRBE, collect the trace
>> and then reprogram the TRBE with the updated buffer pointers in case
>> of a spurious IRQ. We might also leave the TRBE disabled, on an
>> overflow interrupt, without touching the ETE. This means the
>> the ETE is only disabled when the event is disabled later (via irq_work).
>> This is incorrect, as the ETE trace is still ON without actually being
>> captured and may be routed to the ATB.
>
> I had an assumption that when the TRBE is stopped, ETE would also stop
> implicitly given that the trace packets are not being accepted anymore.
> But if that assumption does not always hold true, then yes trace must
> be stopped upon a TRBE IRQ.
No, the ETE never stops, until it is stopped. The ETE doesn't care who
is the consumer of the trace. Be it TRBE or ATB or any other sink.
>
>>
>> So, we move the CPU into trace prohibited state (for all exception
>> levels) upon entering the IRQ handler. The state is restored before
>> enabling the TRBE back. Otherwise the trace remains prohibited.
>> Since, the ETM/ETE driver controls the TRFCR_EL1 per session,
>> (from commit "coresight: etm4x: Use Trace Filtering controls dynamically")
>
> commit SHA ID ?
>
The patch is in this series, not committed yet.
>> the tracing can be restored/enabled back when the event is rescheduled
>> in.
>
> Makes sense.
>
>>
>> Fixes: 3fbf7f011f24 ("coresight: sink: Add TRBE driver")
>> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
>> Cc: Mathieu Poirier <mathieu.poirier(a)linaro.org>
>> Cc: Mike Leach <mike.leach(a)linaro.org>
>> Cc: Leo Yan <leo.yan(a)linaro.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
>> ---
>> drivers/hwtracing/coresight/coresight-trbe.c | 43 ++++++++++++++++++--
>> 1 file changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index c0c264264427..e4d88e0de2a8 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -83,6 +83,31 @@ struct trbe_drvdata {
>> struct platform_device *pdev;
>> };
>>
>> +static inline void write_trfcr(u64 val)
>> +{
>> + write_sysreg_s(val, SYS_TRFCR_EL1);
>> + isb();
>> +}
>> +
>
> There is another instance of write_trfcr() in coresight-etm4x-core.c and
> some other writes into SYS_TRFCR_EL1 elsewhere. write_trfcr() should be
> factored out and moved to a common place.
Agreed, but I couldn't find a right candidate for this. Welcome
to suggestions. May be we could add something like:
asm/self-hosted.h
>
>> +/*
>> + * Prohibit the CPU tracing at all ELs, in preparation to collect
>> + * the trace buffer.
>> + *
>> + * Returns the original value of the trfcr for restoring later.
>> + */
>> +static u64 cpu_prohibit_tracing(void)
>> +{
>> + u64 trfcr = read_sysreg_s(SYS_TRFCR_EL1);
>> +
>> + write_trfcr(trfcr & ~(TRFCR_ELx_ExTRE | TRFCR_ELx_E0TRE));
>> + return trfcr;
>> +}
>
> This also should be factored out along with etm4x_prohibit_trace()
> usage and moved to a common header instead.
>
>> +
>> +static void cpu_restore_tracing(u64 trfcr)
>> +{
>> + write_trfcr(trfcr);
>> +}
>> +
>> static int trbe_alloc_node(struct perf_event *event)
>> {
>> if (event->cpu == -1)
>> @@ -681,7 +706,7 @@ static int arm_trbe_disable(struct coresight_device *csdev)
>> return 0;
>> }
>>
>> -static void trbe_handle_spurious(struct perf_output_handle *handle)
>> +static void trbe_handle_spurious(struct perf_output_handle *handle, u64 trfcr)
>> {
>> struct trbe_buf *buf = etm_perf_sink_config(handle);
>>
>> @@ -691,6 +716,7 @@ static void trbe_handle_spurious(struct perf_output_handle *handle)
>> trbe_drain_and_disable_local();
>> return;
>> }
>
> A small comment here would be great because this will be the only
> IRQ handler path, where it actually restores the tracing back.
Agreed
>
>> + cpu_restore_tracing(trfcr);
>> trbe_enable_hw(buf);
>> }
>>
>> @@ -760,7 +786,18 @@ static irqreturn_t arm_trbe_irq_handler(int irq, void *dev)
>> struct perf_output_handle **handle_ptr = dev;
>> struct perf_output_handle *handle = *handle_ptr;
>> enum trbe_fault_action act;
>> - u64 status;
>> + u64 status, trfcr;
>> +
>> + /*
>> + * Prohibit the tracing, while we process this. We turn
>> + * things back right, if we get to enabling the TRBE
>> + * back again. Otherwise, the tracing still remains
>> + * prohibited, until the perf event state changes
>> + * or another event is scheduled. This ensures that
>> + * the trace is not generated when it cannot be
>> + * captured.
>> + */
>
> Right.
>
> But a small nit though. Please keep the comments here formatted and
> aligned with the existing ones.
>
ok
>> + trfcr = cpu_prohibit_tracing();
>>
>> /*
>> * Ensure the trace is visible to the CPUs and
>> @@ -791,7 +828,7 @@ static irqreturn_t arm_trbe_irq_handler(int irq, void *dev)
>> trbe_handle_overflow(handle);
>> break;
>> case TRBE_FAULT_ACT_SPURIOUS:
>> - trbe_handle_spurious(handle);
>> + trbe_handle_spurious(handle, trfcr);
>> break;
>> case TRBE_FAULT_ACT_FATAL:
>> trbe_stop_and_truncate_event(handle);
>>
>
> But stopping the trace (even though from a sink IRQ handler) is a source
> device action. Should not this be done via a new coresight_ops_source
> callback instead ?
It is a valid point. But that has limitations.
Here is the list:
* Stopping the source is a heavy hammer, especially if we
are about to continue the trace soon. (e.g, spurious
interrupt and possibly soon for FILL events with reworking
the flags)
* Stopping the source, via source_ops() is doing things
under the driving mode of the session, perf vs sysfs.
We only support perf though, but if there is another
user.
* This is agnostic to the mode (as above), the TRBE driver
doesn't need to be taught, how to find the path and
stop the current session for the given mode.
* If the tracing is enabled in kernel mode, the ETE still
generates the trace until we trigger the longer route
for disabling, which is not nice.
Suzuki
On 15/07/2021 04:23, Anshuman Khandual wrote:
>
>
> On 7/12/21 5:08 PM, Suzuki K Poulose wrote:
>> The TRBE driver issues the irq_work_run() to ensure that
>> any pending disable event callback has been processed. But this
>> is done before we mark the AUX buffer as TRUNCATED, making
>> the call pointless. Move the call after the current handle
>> has been closed.
>
> So there is a possibility that a disable event gets missed before
> the buffer is marked TRUNCATED ? But even then would not another
No, it is the other way around. A disable event is triggered
in response to the TRUNCATED flag.
> disable event be triggered again subsequently ? Just trying to
> understand what is the actual problem because of the current
> placement of irq_work_run() ?
That begs the question, What is the purpose of irq_work_run() ?
See below.
>
>>
>> Fixes: 3fbf7f011f24 ("coresight: sink: Add TRBE driver")
>> Reported-by: Tamas Zsoldos <Tamas.Zsoldos(a)arm.com>
>> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
>> Cc: Mathieu Poirier <mathieu.poirier(a)linaro.org>
>> Cc: Mike Leach <mike.leach(a)linaro.org>
>> Cc: Peter Zijlstra (Intel) <peterz(a)infradead.org>
>> Cc: Leo Yan <leo.yan(a)linaro.org>
>> Cc: Will Deacon <will(a)kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
>> ---
>> drivers/hwtracing/coresight/coresight-trbe.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index ec38cf17b693..c0c264264427 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -723,6 +723,14 @@ static void trbe_handle_overflow(struct perf_output_handle *handle)
>> perf_aux_output_flag(handle, PERF_AUX_FLAG_CORESIGHT_FORMAT_RAW |
>> PERF_AUX_FLAG_TRUNCATED);
>> perf_aux_output_end(handle, size);
>> +
>> + /*
>> + * Ensure perf callbacks have completed. Since we
>> + * always TRUNCATE the buffer on IRQ, the event
>> + * is scheduled to be disabled. Make sure that happens
>> + * as soon as possible.
>> + */
>> + irq_work_run();
>> }
>>
>> static bool is_perf_trbe(struct perf_output_handle *handle)
>> @@ -777,12 +785,6 @@ static irqreturn_t arm_trbe_irq_handler(int irq, void *dev)
>> if (!is_perf_trbe(handle))
>> return IRQ_NONE;
>>
>> - /*
>> - * Ensure perf callbacks have completed, which may disable
>> - * the trace buffer in response to a TRUNCATION flag.
>> - */
This comment here explains what the irq_work_run() is supposed to do.
This clearly indicates that the event will be disabled in response
to TRUNCATION. Now, we haven't updated the buffer flags yet here.
It happens *after* this call, in handle_overflow(). So, this is placed
incorrectly. The purpose of this patch is to make sure that we actually
complete the callbacks in response to the TRUNCATED flag we set with
this IRQ.
>> - irq_work_run();
>> -
>> act = trbe_get_fault_act(status);
>> switch (act) {
>> case TRBE_FAULT_ACT_WRAP:
>>
Suzuki
On 15/07/2021 04:54, Anshuman Khandual wrote:
>
>
> On 7/12/21 5:08 PM, Suzuki K Poulose wrote:
>> When an AUX buffer is marked TRUNCATED, the kernel will disable
>> the event (via irq_work) and can only be re-enabled by the
>> userspace tool.
>
> Will it be renabled via normal perf event enable callback OR an
> explicit perf event restart is required upon the TRUNCATED flag ?
The perf event moves to a DISABLED state. At this state an
explicit restart from the userspace tool is required, via
PERF_EVENT_IOC_ENABLE.
>
>>
>> Also, since we *always* mark the buffer as TRUNCATED (which is
>> needs to be reconsidered, see below), we need not re-enable the
>> TRBE as the event is going to be now disabled. This follows the
>> SPE driver behavior.
>>
>> Without this change, we could leave the event disabled for
>> ever under certain conditions. i.e, when we try to re-enable
>> in the IRQ handler, if there is no space left in the buffer,
>> we "TRUNCATE" the buffer and create a record of size 0.
>> The perf tool skips such records and the event will remain
>> disabled forever.
>
> Why ? Should not the user space tool explicitly start back the
> tracing event after detecting zero sized record with buffer
> marked with TRUNCATE flag ? What prevents it to restart the
> event ?
The perf tool discards a 0 sized packet. While I agree that
this is something that should be fixed in perf tool too, this
deviation from the "expected driver" behavior (see SPE driver
which this one was inspired from) will break the existing
perf tools (not an ABI break, but a functional issue
which is not nice and generally discouraged in the kernel).
>
>>
>> Regarding the use of TRUNCATED flag:
>> With FILL mode, the TRBE stops collection when the buffer is
>> full, raising the IRQ. Now, since the IRQ is asynchronous,
>> we could loose some amount of trace, after the buffer was
>> full until the IRQ is handled. Also the last packet may
>> have been trimmed. The decoder can figure this out and
>> cope with this. The side effect of this approach is:
>>
>> We always disable the event when there is an IRQ, even
>> when the other half of the ring buffer is free ! This
>> is not ideal.
>>
>> Now, we should switch to using PARTIAL to indicate that there
>> was potentially some partial trace packets in the buffer and
>> some data was lost. We should use TRUNCATED only when there
>> is absolutely no space in the ring buffer. This change would
>> also require some additional changes in the CoreSight PMU
>> framework to allow, sinks to "close" the handle (rather
>> than the PMU driver closing the handle upon event_stop).
>> So, until that is sorted, let us keep the TRUNCATED flag
>> and the rework can be addressed separately.
>
> But I guess this is a separate problem all together.
Yes, it is. As mentioned above, we need changes to the
coresight PMU framework to be able to use the available
buffer. And the message already is clear, that this
is fixing the "odd" behavior.
>
>>
>> Fixes: 3fbf7f011f24 ("coresight: sink: Add TRBE driver")
>> Reported-by: Tamas Zsoldos <Tamas.Zsoldos(a)arm.com>
>> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
>> Cc: Mathieu Poirier <mathieu.poirier(a)linaro.org>
>> Cc: Mike Leach <mike.leach(a)linaro.org>
>> Cc: Leo Yan <leo.yan(a)linaro.org>
>> Cc: Peter Zijlstra (Intel) <peterz(a)infradead.org>
>> Cc: Will Deacon <will(a)kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
>> ---
>> drivers/hwtracing/coresight/coresight-trbe.c | 34 +++++++-------------
>> 1 file changed, 12 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index 176868496879..ec38cf17b693 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -696,10 +696,8 @@ static void trbe_handle_spurious(struct perf_output_handle *handle)
>>
>> static void trbe_handle_overflow(struct perf_output_handle *handle)
>> {
>> - struct perf_event *event = handle->event;
>> struct trbe_buf *buf = etm_perf_sink_config(handle);
>> unsigned long offset, size;
>> - struct etm_event_data *event_data;
>>
>> offset = get_trbe_limit_pointer() - get_trbe_base_pointer();
>> size = offset - PERF_IDX2OFF(handle->head, buf);
>> @@ -709,30 +707,22 @@ static void trbe_handle_overflow(struct perf_output_handle *handle)
>> /*
>> * Mark the buffer as truncated, as we have stopped the trace
>> * collection upon the WRAP event, without stopping the source.
>> + *
>> + * We don't re-enable the TRBE here, as the event is
>> + * bound to be disabled due to the TRUNCATED flag.
>> + * This is not ideal, as we could use the available space in
>> + * the ring buffer and continue the tracing.
>> + *
>> + * TODO: Revisit the use of TRUNCATED flag and may be instead use
>> + * PARTIAL, to indicate trace may contain partial packets.
>> + * And TRUNCATED can be used only if we do not have enough space
>> + * in the buffer. This would need additional changes in
>> + * etm_event_stop() to allow the sinks to leave a closed
>> + * aux_handle.
>> */
>> perf_aux_output_flag(handle, PERF_AUX_FLAG_CORESIGHT_FORMAT_RAW |
>> PERF_AUX_FLAG_TRUNCATED);
>> perf_aux_output_end(handle, size);
>> - event_data = perf_aux_output_begin(handle, event);
>> - if (!event_data) {
>> - /*
>> - * We are unable to restart the trace collection,
>> - * thus leave the TRBE disabled. The etm-perf driver
>> - * is able to detect this with a disconnected handle
>> - * (handle->event = NULL).
>> - */
>> - trbe_drain_and_disable_local();
>> - *this_cpu_ptr(buf->cpudata->drvdata->handle) = NULL;
>> - return;
>> - }
>> - buf->trbe_limit = compute_trbe_buffer_limit(handle);
>> - buf->trbe_write = buf->trbe_base + PERF_IDX2OFF(handle->head, buf);
>> - if (buf->trbe_limit == buf->trbe_base) {
>> - trbe_stop_and_truncate_event(handle);
>> - return;
>> - }
>> - *this_cpu_ptr(buf->cpudata->drvdata->handle) = handle;
>> - trbe_enable_hw(buf);
>> }
>
> The change here just stops the event restart after handling the IRQ
> and marking the buffer with PERF_AUX_FLAG_TRUNCATED which helps the
> event from being disabled for ever (still need to understand how !).
The real issue is unnecessary starting of the event with new buffer
after we have "TRUNCATED" the buffer. This can lead to occassionally
hitting "0" sized buffer, because the "irq_work_run()" could kick
in and disable the event, leaving no trace generated (because we
were tracing only userspace). So, we now have a 0 sized record
with the event in disabled state, which the perf tool ignores.
>
> I guess it might be better to separate out the problems with using
> PERF_AUX_FLAG_TRUNCATED and related aspects, in a subsequent patch
> which just updates the code with the above TODO comment ?
The problems with the driver using TRUNCATED flag must be addressed
in a different patch. This patch is only to comply with the behavior,
of "TRUNCATED" indicates the event is disabled. So do not start a
session.
Does that help ?
Suzuki
Hi,
I'd like to see if anyone is currently using piped mode in perf. For example:
perf record -o - ls > stdio.data
cat stdio.data | ./perf report -i -
This invokes a slightly different code path where there is no auxtrace index
and no random file access available. Because of this, the outcome of decoding
can be very subtly different. For example only one auxtrace buffer could be read
at the point where decoding starts. Whereas with normal mode, all the buffers
are queued in advance because of the availability of the index and random
access.
A few of the changes I've proposed recently have added even more dependency on
normal mode and won't work in piped mode either. For example "perf cs-etm:
Support TRBE (unformatted decoding)" and "perf cs-etm: Split Coresight decode
by aux records".
The reason that I made these changes in this way was because it was much more
difficult to juggle the order of events and make sure nothing regressed,
with the knowledge that there are already inaccuracies in piped mode and not
knowing if it was even used.
The question is: whether we continue to support this for Coresight as it
becomes more and more divergent to piped mode.
There are some possible outcomes that I can think of:
* Drop the mode completely
* Add a warning that it's only best effort in that mode and continue with the divergence
* Fix it up first to make it behave the same as normal mode.
* After fixing it up, require that all future decoding behaves identically in both modes
* Then we can more comfortably re-do my above two changes
Thanks
James
On Tue, Jul 13, 2021 at 02:56:06PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 11, 2021 at 06:40:57PM +0800, Leo Yan wrote:
> > AUX ring buffer is required to separate the data store and aux_head
> > store, since the function CS_LOCK() has contained memory barrier mb(),
> > mb() is a more conservative barrier than smp_wmb() on Arm32/Arm64, thus
> > it's needless to add any explicit barrier anymore.
> >
> > Add comment to make clear for the barrier usage for ETF.
> >
> > Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
> > ---
> > drivers/hwtracing/coresight/coresight-tmc-etf.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > index 45b85edfc690..9a42ee689921 100644
> > --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > @@ -553,6 +553,12 @@ static unsigned long tmc_update_etf_buffer(struct coresight_device *csdev,
> > if (buf->snapshot)
> > handle->head += to_read;
> >
> > + /*
> > + * AUX ring buffer requires to use memory barrier to separate the trace
> > + * data store and aux_head store, because CS_LOCK() contains mb() which
> > + * gives more heavy barrier than smp_wmb(), it's not necessary to
> > + * explicitly invoke any barrier.
> > + */
> > CS_LOCK(drvdata->base);
>
> 'more heavy' is not a correctness argument :-)
>
> The argument to make here is that CS_LOCK() ensures completion /
> visibility of the hardware buffer.
Will correct for this, thanks for reminding!
On Tue, Jul 13, 2021 at 10:07:03AM +0300, Adrian Hunter wrote:
[...]
> > +/*
> > + * In the compat mode kernel runs in 64-bit and perf tool runs in 32-bit mode,
> > + * 32-bit perf tool cannot access 64-bit value atomically, which might lead to
> > + * the issues caused by the below sequence on multiple CPUs: when perf tool
> > + * accesses either the load operation or the store operation for 64-bit value,
> > + * on some architectures the operation is divided into two instructions, one
> > + * is for accessing the low 32-bit value and another is for the high 32-bit;
> > + * thus these two user operations can give the kernel chances to access the
> > + * 64-bit value, and thus leads to the unexpected load values.
> > + *
> > + * kernel (64-bit) user (32-bit)
> > + *
> > + * if (LOAD ->aux_tail) { --, LOAD ->aux_head_lo
> > + * STORE $aux_data | ,--->
> > + * FLUSH $aux_data | | LOAD ->aux_head_hi
> > + * STORE ->aux_head --|-------` smp_rmb()
> > + * } | LOAD $data
> > + * | smp_mb()
> > + * | STORE ->aux_tail_lo
> > + * `----------->
> > + * STORE ->aux_tail_hi
> > + *
> > + * For this reason, it's impossible for the perf tool to work correctly when
> > + * the AUX head or tail is bigger than 4GB (more than 32 bits length); and we
> > + * can not simply limit the AUX ring buffer to less than 4GB, the reason is
> > + * the pointers can be increased monotonically (e.g in snapshot mode), whatever
>
> At least for Intel PT, in snapshot mode, the head is always an offset
> into the buffer, so never more than 4GB for a 32-bit perf tool. So maybe
> leave out "(e.g in snapshot mode)"
Sure, will leave out "(e.g in snapshot mode)".
Thanks,
Leo
Hi Russell,
On Mon, Jul 12, 2021 at 03:44:11PM +0100, Russell King (Oracle) wrote:
> On Sun, Jul 11, 2021 at 06:41:05PM +0800, Leo Yan wrote:
> > When perf runs in compat mode (kernel in 64-bit mode and the perf is in
> > 32-bit mode), the 64-bit value atomicity in the user space cannot be
> > assured, E.g. on some architectures, the 64-bit value accessing is split
> > into two instructions, one is for the low 32-bit word accessing and
> > another is for the high 32-bit word.
>
> Does this apply to 32-bit ARM code on aarch64? I would not have thought
> it would, as the structure member is a __u64 and
> compat_auxtrace_mmap__read_head() doesn't seem to be marking anything
> as packed, so the compiler _should_ be able to use a LDRD instruction
> to load the value.
I think essentially your question is relevant to the memory model.
For 32-bit Arm application on aarch64, in the Armv8 architecture
reference manual ARM DDI 0487F.c, chapter "E2.2.1
Requirements for single-copy atomicity" describes:
"LDM, LDC, LDRD, STM, STC, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR, VSTM,
and VSTR instructions are executed as a sequence of word-aligned word
accesses. Each 32-bit word access is guaranteed to be single-copy
atomic. The architecture does not require subsequences of two or more
word accesses from the sequence to be single-copy atomic."
So I think LDRD/STRD instruction cannot promise the atomicity for
loading or storing two words in 32-bit Arm.
And another thought is the functions compat_auxtrace_mmap__read_head()
is a general function, I avoid to write it with any architecture
specific instructions.
> Is this a problem noticed on non-ARM architectures?
No, actually we just concluded the potential issue based on the analysis
for the weak memory model.
Thanks,
Leo