This patch series is to refine the memory barriers for AUX ring buffer.
Patches 01 ~ 04 to address the barriers usage in the kernel. The first
patch is to make clear comment for how to use the barriers between the
data store and aux_head store, this asks the driver to make sure the
data is visible. Patches 02 ~ 04 is to refine the drivers for barriers
after the data store.
Patch 05 is to use WRITE_ONCE() for updating aux_tail.
Patches 06 ~ 09 is to drop the legacy __sync functions, and polish for
duplicate code and cleanup the build and feature test after
SYNC_COMPARE_AND_SWAP is not used.
Patch 10 introduces a new global variable to indicate the kernel runs in
64-bit mode which can be used to confirm if in compat mode; patch 11
introduces variant functions for accessing AUX head/tail, it resolves
the aotmicity for reading head pointer, and returns error for the tail
is bigger than 4GB.
Have testes the patches on Arm64 Juno platform.
Changes from v3:
- Removed the inapprocate paragraph in the commit log for patch "perf
auxtrace: Drop legacy __sync functions" (Adrian);
- Added new patch to remove feature-sync-compare-and-swap test (Adrian);
- Th patch for "perf auxtrace: Use WRITE_ONCE() for updating aux_tail",
is a standlone and simple change, so moved it ahead in the patch set
for better ordering;
- Minor improvement for commit logs in the last two patches.
Changes from v2:
- Removed auxtrace_mmap__read_snapshot_head(), which has the duplicated
code with auxtrace_mmap__read_head();
- Cleanuped the build for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT (Adrian);
- Added global variable "kernel_is_64_bit" (Adrian);
- Added compat variants compat_auxtrace_mmap__{read_head|write_tail}
(Adrian).
Leo Yan (11):
perf/ring_buffer: Add comment for barriers on AUX ring buffer
coresight: tmc-etr: Add barrier after updating AUX ring buffer
coresight: tmc-etf: Add comment for store ordering
perf/x86: Add barrier after updating bts
perf auxtrace: Use WRITE_ONCE() for updating aux_tail
perf auxtrace: Drop legacy __sync functions
perf auxtrace: Remove auxtrace_mmap__read_snapshot_head()
perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT
tools: Remove feature-sync-compare-and-swap feature detection
perf env: Set flag for kernel is 64-bit mode
perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}
arch/x86/events/intel/bts.c | 3 +
.../hwtracing/coresight/coresight-tmc-etf.c | 6 +
.../hwtracing/coresight/coresight-tmc-etr.c | 8 ++
kernel/events/ring_buffer.c | 9 ++
tools/build/Makefile.feature | 1 -
tools/build/feature/Makefile | 4 -
tools/build/feature/test-all.c | 4 -
.../feature/test-sync-compare-and-swap.c | 15 ---
tools/perf/Makefile.config | 4 -
tools/perf/util/auxtrace.c | 19 ++-
tools/perf/util/auxtrace.h | 109 ++++++++++++++----
tools/perf/util/env.c | 17 ++-
tools/perf/util/env.h | 1 +
13 files changed, 136 insertions(+), 64 deletions(-)
delete mode 100644 tools/build/feature/test-sync-compare-and-swap.c
--
2.25.1
The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the
low level's architecture code, e.g. for Arm64, it maps the memory with
the attribution "Normal non-cacheable"; this can be concluded from the
definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h.
Later when access the AUX bounce buffer, since the memory mapping is
non-cacheable, it's low efficiency due to every load instruction must
reach out DRAM.
This patch changes to allocate pages with alloc_pages_node(), thus the
driver can access the memory with cacheable mapping in the kernel linear
virtual address; therefore, because load instructions can fetch data
from cache lines rather than always read data from DRAM, the driver can
boost memory coping performance. After using the cacheable mapping, the
driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to
read bounce buffer so can avoid read stale trace data.
By measurement the duration for function tmc_update_etr_buffer() with
ftrace function_graph tracer, it shows the performance significant
improvement for copying 4MiB data from bounce buffer:
# echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise
# echo tmc_update_etr_buffer > set_graph_function
# echo function_graph > current_tracer
before:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 8148.320 us | }
after:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 2463.980 us | }
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
Changes from v1:
Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the
unexpected if condition change in tmc_etr_free_flat_buf().
.../hwtracing/coresight/coresight-tmc-etr.c | 56 ++++++++++++++++---
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index acdb59e0e661..888b0f929d33 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -21,6 +21,7 @@
struct etr_flat_buf {
struct device *dev;
+ struct page *pages;
dma_addr_t daddr;
void *vaddr;
size_t size;
@@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
{
struct etr_flat_buf *flat_buf;
struct device *real_dev = drvdata->csdev->dev.parent;
+ ssize_t aligned_size;
/* We cannot reuse existing pages for flat buf */
if (pages)
@@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
if (!flat_buf)
return -ENOMEM;
- flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size,
- &flat_buf->daddr, GFP_KERNEL);
- if (!flat_buf->vaddr) {
- kfree(flat_buf);
- return -ENOMEM;
+ aligned_size = PAGE_ALIGN(etr_buf->size);
+ flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
+ get_order(aligned_size));
+ if (!flat_buf->pages)
+ goto fail_alloc_pages;
+
+ flat_buf->vaddr = page_address(flat_buf->pages);
+ flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0,
+ aligned_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(real_dev, flat_buf->daddr)) {
+ flat_buf->daddr = 0;
+ goto fail_dma_map_page;
}
flat_buf->size = etr_buf->size;
@@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
etr_buf->mode = ETR_MODE_FLAT;
etr_buf->private = flat_buf;
return 0;
+
+fail_dma_map_page:
+ __free_pages(flat_buf->pages, get_order(aligned_size));
+fail_alloc_pages:
+ kfree(flat_buf);
+ return -ENOMEM;
}
static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
@@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
if (flat_buf && flat_buf->daddr) {
struct device *real_dev = flat_buf->dev->parent;
+ ssize_t aligned_size = PAGE_ALIGN(etr_buf->size);
- dma_free_coherent(real_dev, flat_buf->size,
- flat_buf->vaddr, flat_buf->daddr);
+ dma_unmap_page(real_dev, flat_buf->daddr, aligned_size,
+ DMA_FROM_DEVICE);
+ __free_pages(flat_buf->pages, get_order(aligned_size));
}
kfree(flat_buf);
}
static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
{
+ struct etr_flat_buf *flat_buf = etr_buf->private;
+ struct device *real_dev = flat_buf->dev->parent;
+
/*
* Adjust the buffer to point to the beginning of the trace data
* and update the available trace data.
@@ -648,6 +668,28 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
etr_buf->len = etr_buf->size;
else
etr_buf->len = rwp - rrp;
+
+ if (etr_buf->offset + etr_buf->len > etr_buf->size) {
+ int len1, len2;
+
+ /*
+ * If trace data is wrapped around, sync AUX bounce buffer
+ * for two chunks: "len1" is for the trace date length at
+ * the tail of bounce buffer, and "len2" is the length from
+ * the start of the buffer after wrapping around.
+ */
+ len1 = etr_buf->size - etr_buf->offset;
+ len2 = etr_buf->len - len1;
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ len1, DMA_FROM_DEVICE);
+ dma_sync_single_for_cpu(real_dev, flat_buf->daddr,
+ len2, DMA_FROM_DEVICE);
+ } else {
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ etr_buf->len, DMA_FROM_DEVICE);
+ }
}
static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf,
--
2.25.1
Current code syncs the buffer range is [offset, offset+len), it doesn't
consider the case when the trace data is wrapped around, in this case
'offset+len' is bigger than 'etr_buf->size'. Thus it syncs buffer out
of the memory buffer, and it also misses to sync buffer from the start
of the memory.
This patch corrects the memory sync ranges, when detects the wrapping
around case, it splits into two chunks: one chunk is the tail of the
buffer and another chunk is from the start of the buffer after wrapping
around.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
.../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index 888b0f929d33..a1afefcbf175 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -780,7 +780,23 @@ static void tmc_etr_sync_sg_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
else
etr_buf->len = ((w_offset < r_offset) ? etr_buf->size : 0) +
w_offset - r_offset;
- tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len);
+
+ if (r_offset + etr_buf->len > etr_buf->size) {
+ int len1, len2;
+
+ /*
+ * If trace data is wrapped around, sync AUX bounce buffer
+ * for two chunks: "len1" is for the trace date length at
+ * the tail of bounce buffer, and "len2" is the length from
+ * the start of the buffer after wrapping around.
+ */
+ len1 = etr_buf->size - r_offset;
+ len2 = etr_buf->len - len1;
+ tmc_sg_table_sync_data_range(table, r_offset, len1);
+ tmc_sg_table_sync_data_range(table, 0, len2);
+ } else {
+ tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len);
+ }
}
static const struct etr_buf_operations etr_sg_buf_ops = {
--
2.25.1
Hi Adrian,
On Sat, Jul 10, 2021 at 03:36:53PM +0300, Adrian Hunter wrote:
> On 4/07/21 10:16 am, Leo Yan wrote:
> > Since the __sync functions have been dropped, This patch removes unused
> > build and checking for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT in perf tool.
> >
> > Note, there have a test for SYNC_COMPARE_AND_SWAP and the test file is
> > located in build/feature/test-sync-compare-and-swap.c. Since there
> > still has several components using the sync functions, it's deliberately
> > to not be removed.
>
> I don't quite follow that. If they aren't using the feature test
> macro, then why keep the feature test?
There are files are still using __sync_xxx_compare_and_swap() functions,
e.g. in the folder tools/testing/selftests/bpf. On the other hand,
after drop __sync functions from perf, there have no any Makefile check
the feature 'feature-sync-compare-and-swap'. So it's safe to remove the
feature test.
Sorry for confusion. Will drop the feature test in new patch set.
Thanks,
Leo
On Sat, Jul 10, 2021 at 03:34:24PM +0300, Adrian Hunter wrote:
> On 4/07/21 10:16 am, Leo Yan wrote:
> > The main purpose for using __sync built-in functions is to support
> > compat mode for 32-bit perf with 64-bit kernel. But using these
> > built-in functions might cause couple potential issues.
> >
> > Firstly, __sync functions originally support Intel Itanium processoer [1]
> > but it cannot promise to support all 32-bit archs. Now these
> > functions have become the legacy functions.
> >
> > As Peter also pointed out the logic issue in the function
> > auxtrace_mmap__write_tail(), it does a cmpxchg with 0 values to load
> > old_tail, and then executes a further cmpxchg with old_tail to write
> > the new tail. If consider the aux_tail might be assigned to '0' in the
> > middle of loops, this can introduce mess for AUX buffer if the kernel
> > fetches the temporary value '0'.
>
> That is not exactly true. The definition of __sync_*_compare_and_swap is
> "if the current value of *ptr is oldval, then write newval into *pt"
> so replacing zero with zero won't make any difference, but it will return
> the old value in any case. Probably better to leave out that paragraph.
Okay, I admit the paragraph is not right, will drop it to avoid
confusion. Thanks for review!
Leo
This patchset represents the second phase of CoreSight configuration
management.
1) API updated to allow dynamic load and unload of configurations and
features. Dependency management between loaded sets is added.
2) New configuration and feature sets can be added using a loadable module.
An example in /samples/coresight is provided to demonstrate this.
3) Resource management API is added. This allows the system to ensure that
loaded configurations and features are only loaded onto devices that can
support them.
Further - it ensures that configurations with multiple features cannot over
allocate resources.
4) configfs can be used to activate a configuration which will then be used
when controlling tracing using sysfs.
5) Resource management is added to ETMv4 configurations. This allows current
and future features and configurations to be defined in terms of resources
used as well as registers to be programmed.
Defining features in this way allows the resource management to operate
correctly.
The perf event parsing is also adjusted to allow the ETM resources requested
on the command line (e.g. address filters, etc) to be correctly handled
using resoruce management alongside the complex configurations such as
autofdo.
Applies to coresight/next - which is 5.13-rc1 + initial Coresight configuration
patchset.
To follow in future revisions / sets:-
a) load of additional config and features by configfs
b) ECT and CTI and other Coresight components support for configuration and
features.
Mike Leach (8):
coresight: syscfg: Update API to allow dynamic load and unload
coresight: syscfg: Update load API for config loadable modules
coresight: syscfg: Example CoreSight configuration loadable module
coresight: configfs: Allow configfs to activate configuration.
coresight: syscfg: Add API to check and validate device resources.
coresight: etm4x: syscfg: Add resource management to etm4x.
coresight: etm4x: Update perf event resource handling.
coresight: etm4x: Update configuration example.
MAINTAINERS | 1 +
.../hwtracing/coresight/coresight-cfg-afdo.c | 38 +-
.../coresight/coresight-cfg-preload.c | 9 +-
.../hwtracing/coresight/coresight-config.c | 71 ++-
.../hwtracing/coresight/coresight-config.h | 45 +-
.../hwtracing/coresight/coresight-etm4x-cfg.c | 533 ++++++++++++++++++
.../hwtracing/coresight/coresight-etm4x-cfg.h | 196 ++++++-
.../coresight/coresight-etm4x-core.c | 250 +++-----
.../coresight/coresight-syscfg-configfs.c | 87 +++
.../coresight/coresight-syscfg-configfs.h | 4 +
.../hwtracing/coresight/coresight-syscfg.c | 390 +++++++++++--
.../hwtracing/coresight/coresight-syscfg.h | 38 +-
include/linux/coresight.h | 2 +
samples/Kconfig | 9 +
samples/Makefile | 1 +
samples/coresight/Makefile | 4 +
samples/coresight/coresight-cfg-sample.c | 73 +++
17 files changed, 1511 insertions(+), 240 deletions(-)
create mode 100644 samples/coresight/Makefile
create mode 100644 samples/coresight/coresight-cfg-sample.c
--
2.17.1
This patch series is to refine the memory barriers for AUX ring buffer.
Patches 01 ~ 04 to address the barriers usage in the kernel. The first
patch is to make clear comment for how to use the barriers between the
data store and aux_head store, this asks the driver to make sure the
data is visible. Patches 02 ~ 04 is to refine the drivers for barriers
after the data store.
Patches 05 ~ 07 is to drop the legacy __sync functions, and polish for
duplicate code and cleanup the build after SYNC_COMPARE_AND_SWAP is not
used.
Patch 08 is to use WRITE_ONCE() for updating aux_tail.
Since the 64-bit value's atomicity is not promised on 32-bit perf, the
last two patches tries to fixup for perf tool when it runs in compat
mode. Patch 09 introduces a new global variable to indicate the kernel
runs in 64-bit mode which can be used to confirm if in compat mode;
patch 10 introduces variant functions for accessing AUX head/tail, it
can resolve the aotmicity issue for reading head pointer, and for the
tail write overflow issue it returns error to notify the tool to exit.
Have testes the patches on Arm64 Juno platform.
Changes from v2:
- Removed auxtrace_mmap__read_snapshot_head(), which has the duplicated
code with auxtrace_mmap__read_head();
- Cleanuped the build for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT (Adrian);
- Added global variable "kernel_is_64_bit" (Adrian);
- Added compat variants compat_auxtrace_mmap__{read_head|write_tail}
(Adrian).
Leo Yan (10):
perf/ring_buffer: Add comment for barriers on AUX ring buffer
coresight: tmc-etr: Add barrier after updating AUX ring buffer
coresight: tmc-etf: Add comment for store ordering
perf/x86: Add barrier after updating bts
perf auxtrace: Drop legacy __sync functions
perf auxtrace: Remove auxtrace_mmap__read_snapshot_head()
perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT
perf auxtrace: Use WRITE_ONCE() for updating aux_tail
perf env: Set kernel bit mode
perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}
arch/x86/events/intel/bts.c | 3 +
.../hwtracing/coresight/coresight-tmc-etf.c | 6 +
.../hwtracing/coresight/coresight-tmc-etr.c | 8 ++
kernel/events/ring_buffer.c | 9 ++
tools/perf/Makefile.config | 4 -
tools/perf/util/auxtrace.c | 19 ++-
tools/perf/util/auxtrace.h | 109 ++++++++++++++----
tools/perf/util/env.c | 17 ++-
tools/perf/util/env.h | 1 +
9 files changed, 136 insertions(+), 40 deletions(-)
--
2.25.1