Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were: -O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host): cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 29 Sep 2021, at 20:47, Arthur Eubanks aeubanks@google.com wrote:
Do you know the flags passed to Clang to compile the sources? I tried compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
In file included from regexec.c:93: In file included from ./perl.h:384: In file included from /home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144: /home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long') typedef long unsigned int size_t; ^ 1 error generated.
And yeah just moving the code around could cause major performance regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Arthur,
Pre-processed source is in the save-temps tarballs linked below; S_regmatch() is in regexec.i .
The save-temps also have .s assembly file for before and after your patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
Looking into profile of S_regmatch(), some of the extra cycles come from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
Maybe the added instructions in S_reginclass() pushed the loop in S_regmatch() in an unfortunate way?
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 20:05, Arthur Eubanks aeubanks@google.com wrote:
Could I get the source file with S_regmatch()?
On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Arthur,
Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of its hot function S_regmatch() by 14%.
Could you take a look if this is easily fixable, please?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 24 Sep 2021, at 15:07, ci_notify@linaro.org wrote:
After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks aeubanks@google.com
[SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
the following benchmarks slowed down by more than 2%:
- 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
- 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
<2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
Thanks for the flags, I can now reproduce.
I've basically come to the same conclusion as you. There's only one new instance of this optimization triggering throughout the whole file, in S_reginclass(). It doesn't look out of place, and S_regmatch() is identical before and after the patch. So it's likely alignment as you've already mentioned.
There's not much I can do with that information. To further confirm if this is the case, you could try passing -mllvm -align-all-functions=8 (or -align-all-blocks=4, or something along those lines) at head vs with my patch reverted and see how the performance is.
On Fri, Oct 1, 2021 at 2:41 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote:
Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were: -O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host): cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 29 Sep 2021, at 20:47, Arthur Eubanks aeubanks@google.com wrote:
Do you know the flags passed to Clang to compile the sources? I tried
compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
In file included from regexec.c:93: In file included from ./perl.h:384: In file included from
/home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144:
/home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long')
typedef long unsigned int size_t; ^ 1 error generated.
And yeah just moving the code around could cause major performance
regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov <
maxim.kuvyrkov@linaro.org> wrote:
Hi Arthur,
Pre-processed source is in the save-temps tarballs linked below;
S_regmatch() is in regexec.i .
The save-temps also have .s assembly file for before and after your
patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
Looking into profile of S_regmatch(), some of the extra cycles come from
hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
Maybe the added instructions in S_reginclass() pushed the loop in
S_regmatch() in an unfortunate way?
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 20:05, Arthur Eubanks aeubanks@google.com wrote:
Could I get the source file with S_regmatch()?
On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <
maxim.kuvyrkov@linaro.org> wrote:
Hi Arthur,
Your patch seems to be slowing down 400.perlbench by 6% — due to slow
down of its hot function S_regmatch() by 14%.
Could you take a look if this is easily fixable, please?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 24 Sep 2021, at 15:07, ci_notify@linaro.org wrote:
After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks aeubanks@google.com
[SimplifyCFG] Ignore free instructions when computing cost for
folding branch to common dest
the following benchmarks slowed down by more than 2%:
- 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
- 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188
perf samples
Below reproducer instructions can be used to re-build both
"first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed
source and assembly files at:
- First_bad save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and
suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
<2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
On 1 Oct 2021, at 23:37, Arthur Eubanks aeubanks@google.com wrote:
Thanks for the flags, I can now reproduce.
I've basically come to the same conclusion as you. There's only one new instance of this optimization triggering throughout the whole file, in S_reginclass(). It doesn't look out of place,
I was looking at assembly of S_reginclass() and couldn’t figure out the purpose of the extra tst/b.ne instructions. Sure it’s a tiny code-size increase, but their appearance seems silly (or, quite likely, I just don’t understand something).
Do you know why they are now generated?
and S_regmatch() is identical before and after the patch. So it's likely alignment as you've already mentioned.
There's not much I can do with that information. To further confirm if this is the case, you could try passing -mllvm -align-all-functions=8 (or -align-all-blocks=4, or something along those lines) at head vs with my patch reverted and see how the performance is.
Thanks,
-- Maxim Kuvyrkov https://www.linaro.org
On Fri, Oct 1, 2021 at 2:41 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were: -O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host): cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 29 Sep 2021, at 20:47, Arthur Eubanks aeubanks@google.com wrote:
Do you know the flags passed to Clang to compile the sources? I tried compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
In file included from regexec.c:93: In file included from ./perl.h:384: In file included from /home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144: /home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long') typedef long unsigned int size_t; ^ 1 error generated.
And yeah just moving the code around could cause major performance regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Arthur,
Pre-processed source is in the save-temps tarballs linked below; S_regmatch() is in regexec.i .
The save-temps also have .s assembly file for before and after your patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
Looking into profile of S_regmatch(), some of the extra cycles come from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
Maybe the added instructions in S_reginclass() pushed the loop in S_regmatch() in an unfortunate way?
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 20:05, Arthur Eubanks aeubanks@google.com wrote:
Could I get the source file with S_regmatch()?
On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Arthur,
Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of its hot function S_regmatch() by 14%.
Could you take a look if this is easily fixable, please?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 24 Sep 2021, at 15:07, ci_notify@linaro.org wrote:
After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks aeubanks@google.com
[SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
the following benchmarks slowed down by more than 2%:
- 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
- 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
<2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
On Fri, Oct 1, 2021 at 1:53 PM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote:
On 1 Oct 2021, at 23:37, Arthur Eubanks aeubanks@google.com wrote:
Thanks for the flags, I can now reproduce.
I've basically come to the same conclusion as you. There's only one new
instance of this optimization triggering throughout the whole file, in S_reginclass(). It doesn't look out of place,
I was looking at assembly of S_reginclass() and couldn’t figure out the purpose of the extra tst/b.ne instructions. Sure it’s a tiny code-size increase, but their appearance seems silly (or, quite likely, I just don’t understand something).
Do you know why they are now generated?
It's duplicated from .LBB11_29 due to the new instance of the branch folding that happens with my patch. We might end up doing one less branch if we end up getting to .LBB11_36, so the new instructions don't seem useless.
before my patch .LBB11_13: ldr str b 11_29 .LBB11_29: tst b.eq 11_36 ldrb tst b.eq 11_33
after my patch .LBB11_13: tst str b.ne 11_30 b 11_36 .LBB11_29: tst b.eq 11_36 ; fallthrough to 11_30 .LBB11_30: ldrb tst b.eq 11_33
and S_regmatch() is identical before and after the patch. So it's likely
alignment as you've already mentioned.
There's not much I can do with that information. To further confirm if this is the case, you could try passing -mllvm
-align-all-functions=8 (or -align-all-blocks=4, or something along those lines) at head vs with my patch reverted and see how the performance is.
Thanks,
-- Maxim Kuvyrkov https://www.linaro.org
On Fri, Oct 1, 2021 at 2:41 AM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org
wrote:
Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were: -O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host): cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld'
-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 29 Sep 2021, at 20:47, Arthur Eubanks aeubanks@google.com wrote:
Do you know the flags passed to Clang to compile the sources? I tried
compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
In file included from regexec.c:93: In file included from ./perl.h:384: In file included from
/home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144:
/home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long')
typedef long unsigned int size_t; ^ 1 error generated.
And yeah just moving the code around could cause major performance
regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov <
maxim.kuvyrkov@linaro.org> wrote:
Hi Arthur,
Pre-processed source is in the save-temps tarballs linked below;
S_regmatch() is in regexec.i .
The save-temps also have .s assembly file for before and after your
patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
Looking into profile of S_regmatch(), some of the extra cycles come
from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
Maybe the added instructions in S_reginclass() pushed the loop in
S_regmatch() in an unfortunate way?
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 20:05, Arthur Eubanks aeubanks@google.com wrote:
Could I get the source file with S_regmatch()?
On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <
maxim.kuvyrkov@linaro.org> wrote:
Hi Arthur,
Your patch seems to be slowing down 400.perlbench by 6% — due to slow
down of its hot function S_regmatch() by 14%.
Could you take a look if this is easily fixable, please?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 24 Sep 2021, at 15:07, ci_notify@linaro.org wrote:
After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks aeubanks@google.com
[SimplifyCFG] Ignore free instructions when computing cost for
folding branch to common dest
the following benchmarks slowed down by more than 2%:
- 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
- 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to
4188 perf samples
Below reproducer instructions can be used to re-build both
"first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed
source and assembly files at:
- First_bad save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps:
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback
and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
<2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
linaro-toolchain@lists.linaro.org