After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
aarch64: Improve size heuristic for cpymem expansion
the following benchmarks grew in size by more than 1%:
- 458.sjeng grew in size by 4% from 105780 to 109944 bytes
- 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -Os -flto
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Reproduce builds:
<cut>
mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
Date: Wed Sep 29 11:21:45 2021 +0100
aarch64: Improve size heuristic for cpymem expansion
Similar to my previous patch for setmem this one does the same for the cpymem expansion.
We count the number of ops emitted and compare it against the alternative of just calling
the library function when optimising for size.
For the code:
void
cpy_127 (char *out, char *in)
{
__builtin_memcpy (out, in, 127);
}
void
cpy_128 (char *out, char *in)
{
__builtin_memcpy (out, in, 128);
}
we now emit a call to memcpy (with an extra MOV-immediate instruction for the size) instead of:
cpy_127(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldr q0, [x1, 96]
str q0, [x0, 96]
ldr q0, [x1, 111]
str q0, [x0, 111]
ret
cpy_128(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldp q0, q1, [x1, 96]
stp q0, q1, [x0, 96]
ret
which is a clear code size win. Speed optimisation heuristics remain unchanged.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of
emitted operations and adjust heuristic for code size.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* gcc.target/aarch64/cpymem-size.c: New test.
---
gcc/config/aarch64/aarch64.c | 36 ++++++++++++++++++--------
gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++
2 files changed, 54 insertions(+), 11 deletions(-)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ac17c1c88fb..a9a1800af53 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
}
/* Expand cpymem, as if from a __builtin_memcpy. Return true if
- we succeed, otherwise return false. */
+ we succeed, otherwise return false, indicating that a libcall to
+ memcpy should be emitted. */
bool
aarch64_expand_cpymem (rtx *operands)
@@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands)
unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
- /* Inline up to 256 bytes when optimizing for speed. */
+ /* Try to inline up to 256 bytes. */
unsigned HOST_WIDE_INT max_copy_size = 256;
- if (optimize_function_for_size_p (cfun))
- max_copy_size = 128;
+ bool size_p = optimize_function_for_size_p (cfun);
+
+ if (size > max_copy_size)
+ return false;
int copy_bits = 256;
@@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands)
|| !TARGET_SIMD
|| (aarch64_tune_params.extra_tuning_flags
& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
- {
- copy_bits = 128;
- max_copy_size = max_copy_size / 2;
- }
+ copy_bits = 128;
- if (size > max_copy_size)
- return false;
+ /* Emit an inline load+store sequence and count the number of operations
+ involved. We use a simple count of just the loads and stores emitted
+ rather than rtx_insn count as all the pointer adjustments and reg copying
+ in this function will get optimized away later in the pipeline. */
+ start_sequence ();
+ unsigned nops = 0;
base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands)
cur_mode = V4SImode;
aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
-
+ /* A single block copy is 1 load + 1 store. */
+ nops += 2;
n -= mode_bits;
/* Emit trailing copies using overlapping unaligned accesses - this is
@@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands)
n = n_bits;
}
}
+ rtx_insn *seq = get_insns ();
+ end_sequence ();
+
+ /* A memcpy libcall in the worst case takes 3 instructions to prepare the
+ arguments + 1 for the call. */
+ unsigned libcall_cost = 4;
+ if (size_p && libcall_cost < nops)
+ return false;
+ emit_insn (seq);
return true;
}
diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
new file mode 100644
index 00000000000..4d488b74301
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#include <stdlib.h>
+
+/*
+** cpy_127:
+** mov x2, 127
+** b memcpy
+*/
+void
+cpy_127 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 127);
+}
+
+/*
+** cpy_128:
+** mov x2, 128
+** b memcpy
+*/
+void
+cpy_128 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 128);
+}
+
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
</cut>
Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were:
-O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host):
cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
--
Maxim Kuvyrkov
https://www.linaro.org
> On 29 Sep 2021, at 20:47, Arthur Eubanks <aeubanks(a)google.com> wrote:
>
> Do you know the flags passed to Clang to compile the sources? I tried compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
>
> In file included from regexec.c:93:
> In file included from ./perl.h:384:
> In file included from /home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144:
> /home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long')
> typedef long unsigned int size_t;
> ^
> 1 error generated.
>
>
>
> And yeah just moving the code around could cause major performance regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
>
>
> On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote:
> Hi Arthur,
>
> Pre-processed source is in the save-temps tarballs linked below; S_regmatch() is in regexec.i .
>
> The save-temps also have .s assembly file for before and after your patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
>
> Looking into profile of S_regmatch(), some of the extra cycles come from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
>
> Maybe the added instructions in S_reginclass() pushed the loop in S_regmatch() in an unfortunate way?
>
> --
> Maxim Kuvyrkov
> https://www.linaro.org
>
>> On 27 Sep 2021, at 20:05, Arthur Eubanks <aeubanks(a)google.com> wrote:
>>
>> Could I get the source file with S_regmatch()?
>>
>> On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote:
>> Hi Arthur,
>>
>> Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of its hot function S_regmatch() by 14%.
>>
>> Could you take a look if this is easily fixable, please?
>>
>> Regards,
>>
>> --
>> Maxim Kuvyrkov
>> https://www.linaro.org
>>
>> > On 24 Sep 2021, at 15:07, ci_notify(a)linaro.org wrote:
>> >
>> > After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
>> > Author: Arthur Eubanks <aeubanks(a)google.com>
>> >
>> > [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
>> >
>> > the following benchmarks slowed down by more than 2%:
>> > - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
>> > - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples
>> >
>> > Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
>> >
>> > For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
>> > - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> > - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> > - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> >
>> > Configuration:
>> > - Benchmark: SPEC CPU2006
>> > - Toolchain: Clang + Glibc + LLVM Linker
>> > - Version: all components were built from their tip of trunk
>> > - Target: aarch64-linux-gnu
>> > - Compiler flags: -O3
>> > - Hardware: NVidia TX1 4x Cortex-A57
>> >
>> > This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
>
> <2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
Progress
* UM-2 [QEMU upstream maintainership]
+ Worked through my code-review backlog
+ Noticed that we never got round to making our emulated GICv3
support having redistributors in more than one contiguous region;
this prevents using more than 123 CPUs with the virt board. Sent
out a patchset which adds the necessary handling.
+ Generally trying to tie off loose ends pre-holiday :-)
-- PMM