Successfully identified regression in *gcc* in CI configuration tcwg_bmk_gnu_tk1/gnu-release-arm-spec2k6-O3_LTO. So far, this commit has regressed CI configurations: - tcwg_bmk_gnu_tk1/gnu-release-arm-spec2k6-O3_LTO
Culprit: <cut> commit c7207339a7dbce5b68f872064e624dcf1639ba46 Author: Wilco Dijkstra wdijkstr@arm.com Date: Mon Oct 14 12:21:14 2019 +0000
[ARM] Switch to default sched pressure algorithm
Currently the Arm backend selects the alternative sched pressure algorithm. The issue is that this doesn't take register pressure into account, and so it causes significant additional spilling on Arm where there are only 14 allocatable registers. Building SPEC2006 showed significant codesize gains with the default pressure algorithm, so switch back to that. PR77308 shows ~800 fewer instructions.
SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode patches. Overall SPEC codesize is 1.1% smaller.
gcc/ * config/arm/arm.c (arm_option_override): Don't override sched pressure algorithm.
From-SVN: r276960 </cut>
Results regressed to (for first_bad == c7207339a7dbce5b68f872064e624dcf1639ba46) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -5 # true: 0 # benchmark -O3_LTO_marm -- artifacts/build-c7207339a7dbce5b68f872064e624dcf1639ba46/results_id: 1 # 410.bwaves,bwaves_base.default regressed by 108 # 454.calculix,calculix_base.default regressed by 105 # 482.sphinx3,sphinx_livepretend_base.default regressed by 104 # 436.cactusADM,cactusADM_base.default regressed by 116 # 444.namd,namd_base.default regressed by 103 # 435.gromacs,gromacs_base.default regressed by 106
from (for last_good == 7bd8bec53f0e43c7a7852c54650746e65324514b) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -5 # true: 0 # benchmark -O3_LTO_marm -- artifacts/build-7bd8bec53f0e43c7a7852c54650746e65324514b/results_id: 1
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Results ID of last_good: tk1_32/tcwg_bmk_gnu_tk1/bisect-gnu-release-arm-spec2k6-O3_LTO/1468 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Results ID of first_bad: tk1_32/tcwg_bmk_gnu_tk1/bisect-gnu-release-arm-spec2k6-O3_LTO/1469 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar...
Configuration details:
Reproduce builds: <cut> mkdir investigate-gcc-c7207339a7dbce5b68f872064e624dcf1639ba46 cd investigate-gcc-c7207339a7dbce5b68f872064e624dcf1639ba46
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
cd gcc
# Reproduce first_bad build git checkout --detach c7207339a7dbce5b68f872064e624dcf1639ba46 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 7bd8bec53f0e43c7a7852c54650746e65324514b ../artifacts/test.sh
cd .. </cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/c...
Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Build log: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar...
Full commit (up to 1000 lines): <cut> commit c7207339a7dbce5b68f872064e624dcf1639ba46 Author: Wilco Dijkstra wdijkstr@arm.com Date: Mon Oct 14 12:21:14 2019 +0000
[ARM] Switch to default sched pressure algorithm
Currently the Arm backend selects the alternative sched pressure algorithm. The issue is that this doesn't take register pressure into account, and so it causes significant additional spilling on Arm where there are only 14 allocatable registers. Building SPEC2006 showed significant codesize gains with the default pressure algorithm, so switch back to that. PR77308 shows ~800 fewer instructions.
SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode patches. Overall SPEC codesize is 1.1% smaller.
gcc/ * config/arm/arm.c (arm_option_override): Don't override sched pressure algorithm.
From-SVN: r276960 --- gcc/ChangeLog | 5 +++++ gcc/config/arm/arm.c | 5 ----- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/gcc/ChangeLog b/gcc/ChangeLog index c2cbd4274ca..f07a0e61e6b 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2019-10-14 Wilco Dijkstra wdijkstr@arm.com + + * config/arm/arm.c (arm_option_override): Don't override sched + pressure algorithm. + 2019-10-14 Richard Biener rguenther@suse.de
PR tree-optimization/92069 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 39e1a1ef9a2..394b1dd1902 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -3555,11 +3555,6 @@ arm_option_override (void) global_options.x_param_values, global_options_set.x_param_values);
- /* Use the alternative scheduling-pressure algorithm by default. */ - maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL, - global_options.x_param_values, - global_options_set.x_param_values); - /* Look through ready list and all of queue for instructions relevant for L2 auto-prefetcher. */ int param_sched_autopref_queue_depth; </cut>
Hi Wilco,
This report was sent out accidentally, it's for an old patch.
Still, it appears that your patch regresses code-speed of several SPEC2k6 benchmarks by up to 16% on 436.cactusADM when compiled with "-marm -O3 -flto". May be worth to look for low-hanging fruit and get some of the performance back.
-- Maxim Kuvyrkov https://www.linaro.org
On Jul 12, 2021, at 9:23 AM, ci_notify@linaro.org wrote:
Successfully identified regression in *gcc* in CI configuration tcwg_bmk_gnu_tk1/gnu-release-arm-spec2k6-O3_LTO. So far, this commit has regressed CI configurations:
- tcwg_bmk_gnu_tk1/gnu-release-arm-spec2k6-O3_LTO
Culprit:
<cut> commit c7207339a7dbce5b68f872064e624dcf1639ba46 Author: Wilco Dijkstra <wdijkstr@arm.com> Date: Mon Oct 14 12:21:14 2019 +0000
[ARM] Switch to default sched pressure algorithm
Currently the Arm backend selects the alternative sched pressure algorithm. The issue is that this doesn't take register pressure into account, and so it causes significant additional spilling on Arm where there are only 14 allocatable registers. Building SPEC2006 showed significant codesize gains with the default pressure algorithm, so switch back to that. PR77308 shows ~800 fewer instructions.
SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode patches. Overall SPEC codesize is 1.1% smaller.
gcc/ * config/arm/arm.c (arm_option_override): Don't override sched pressure algorithm.
From-SVN: r276960
</cut>
Results regressed to (for first_bad == c7207339a7dbce5b68f872064e624dcf1639ba46) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -5 # true: 0 # benchmark -O3_LTO_marm -- artifacts/build-c7207339a7dbce5b68f872064e624dcf1639ba46/results_id: 1 # 410.bwaves,bwaves_base.default regressed by 108 # 454.calculix,calculix_base.default regressed by 105 # 482.sphinx3,sphinx_livepretend_base.default regressed by 104 # 436.cactusADM,cactusADM_base.default regressed by 116 # 444.namd,namd_base.default regressed by 103 # 435.gromacs,gromacs_base.default regressed by 106
from (for last_good == 7bd8bec53f0e43c7a7852c54650746e65324514b) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=arm --set gcc_override_configure=--disable-libsanitizer: -5 # true: 0 # benchmark -O3_LTO_marm -- artifacts/build-7bd8bec53f0e43c7a7852c54650746e65324514b/results_id: 1
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Results ID of last_good: tk1_32/tcwg_bmk_gnu_tk1/bisect-gnu-release-arm-spec2k6-O3_LTO/1468 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Results ID of first_bad: tk1_32/tcwg_bmk_gnu_tk1/bisect-gnu-release-arm-spec2k6-O3_LTO/1469 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar...
Configuration details:
Reproduce builds:
<cut> mkdir investigate-gcc-c7207339a7dbce5b68f872064e624dcf1639ba46 cd investigate-gcc-c7207339a7dbce5b68f872064e624dcf1639ba46
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
cd gcc
# Reproduce first_bad build git checkout --detach c7207339a7dbce5b68f872064e624dcf1639ba46 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 7bd8bec53f0e43c7a7852c54650746e65324514b ../artifacts/test.sh
cd ..
</cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/c...
Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar... Build log: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-release-ar...
Full commit (up to 1000 lines):
<cut> commit c7207339a7dbce5b68f872064e624dcf1639ba46 Author: Wilco Dijkstra <wdijkstr@arm.com> Date: Mon Oct 14 12:21:14 2019 +0000
[ARM] Switch to default sched pressure algorithm
Currently the Arm backend selects the alternative sched pressure algorithm. The issue is that this doesn't take register pressure into account, and so it causes significant additional spilling on Arm where there are only 14 allocatable registers. Building SPEC2006 showed significant codesize gains with the default pressure algorithm, so switch back to that. PR77308 shows ~800 fewer instructions.
SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode patches. Overall SPEC codesize is 1.1% smaller.
gcc/ * config/arm/arm.c (arm_option_override): Don't override sched pressure algorithm.
From-SVN: r276960
gcc/ChangeLog | 5 +++++ gcc/config/arm/arm.c | 5 ----- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/gcc/ChangeLog b/gcc/ChangeLog index c2cbd4274ca..f07a0e61e6b 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2019-10-14 Wilco Dijkstra wdijkstr@arm.com
- config/arm/arm.c (arm_option_override): Don't override sched
- pressure algorithm.
2019-10-14 Richard Biener rguenther@suse.de
PR tree-optimization/92069 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 39e1a1ef9a2..394b1dd1902 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -3555,11 +3555,6 @@ arm_option_override (void) global_options.x_param_values, global_options_set.x_param_values);
- /* Use the alternative scheduling-pressure algorithm by default. */
- maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
global_options.x_param_values,
global_options_set.x_param_values);
- /* Look through ready list and all of queue for instructions relevant for L2 auto-prefetcher. */ int param_sched_autopref_queue_depth;
</cut>
Hi Maxim,
That sounds rather strange, huge differences due to scheduling are very rare. Which micro architecture was this run on? I can try running it on trunk and see what difference it makes with those options.
Cheers, Wilco IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
[CC: Richard S.]
Hi Wilco,
We use Nvidia TK1s (Cortex-A15) for benchmarking on 32-bit ARM.
LTO tends to increase functions due to additional inlining, which increases scheduling regions, which increases opportunities for the 1st scheduler for inter-block instruction moves, which increases register pressure.
SCHED_PRESSURE_MODEL handles cases with high register pressure well, and switching it off caused a few additional spills in the hot blocks, which caused the slow-down.
It may be worthwhile to bring SCHED_PRESSURE_MODEL back when LTO is enabled.
-- Maxim Kuvyrkov https://www.linaro.org
On 12 Jul 2021, at 13:25, Wilco Dijkstra Wilco.Dijkstra@arm.com wrote:
Hi Maxim,
That sounds rather strange, huge differences due to scheduling are very rare. Which micro architecture was this run on? I can try running it on trunk and see what difference it makes with those options.
Cheers, Wilco IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Maxim,
We use Nvidia TK1s (Cortex-A15) for benchmarking on 32-bit ARM.
That's a bit old, I used Cortex-A57 as the closest to that.
LTO tends to increase functions due to additional inlining, which increases scheduling regions, which increases opportunities for the 1st scheduler for inter-block instruction moves, which increases register pressure.
I don't think this is related to LTO - I see large differences with plain -O2 as well.
SCHED_PRESSURE_MODEL handles cases with high register pressure well, and switching it off caused a few additional spills in the hot blocks, which caused the slow-down.
It may be worthwhile to bring SCHED_PRESSURE_MODEL back when LTO is enabled.
A quick run shows that on trunk --param sched-pressure-algorithm=2 is indeed faster for FP. However turning off pre-realloc scheduling is better overall since it gives 1% gain on INT and 0.5% on FP as well as significant codesize reductions.
So the best way forward for 32-bit Arm is to turn off pre-realloc scheduling as it just causes lots of spilling.
Cheers, Wilco IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
linaro-toolchain@lists.linaro.org