After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%: - 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Configuration: - Benchmark: SPEC CPU2006 - Toolchain: GCC + Glibc + GNU Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -O3 -marm - Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations: - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Reproduce builds: <cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh
cd .. </cut>
Full commit (up to 1000 lines): <cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily.
Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release.
As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust. --- gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } }
-/* There should be exactly three IF conditionals if we thread jumps - properly. */ -/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps + properly. There used to be 3, but one thread was crossing + loops. */ +/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; }
-/* There are 4 jump threading opportunities, all of which will be - realized, which will eliminate testing of FLAG, completely. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops), + all of which will be realized, which will eliminate testing of + FLAG, completely. */ +/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're - eliminated as early as possible. */ -/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was + depending on early threaders threading through loop boundaries + (which we shouldn't do). However, the late threading passes, which + run after loop optimizations , can successfully eliminate the + references to FLAG. Verify that ther are no references by the late + threading passes. */ +/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1[1-9]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition.
All the cases are picked up by VRP1 as jump threads. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */ + +/* There used to be 6 jump threads found by thread1, but they all + depended on threading through distinct loops in ethread. */ /* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */ - #include <stdarg.h> #include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a; - /* In some architectures like ppc64, jump threading may thread - the iteration where i==0 such that we no longer optimize the - BB. Another alternative to disable jump threading would be - to wrap the read from `i' into a function returning i. */ if (arr[i] = i) a = i; else diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; }
+bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{ + gcc_checking_assert (!path.is_empty ()); + edge taken_edge = path[path.length () - 1]->e; + loop_p loop = taken_edge->src->loop_father; + bool seen_latch = false; + bool path_crosses_loops = false; + + for (unsigned int i = 0; i < path.length (); i++) + { + edge e = path[i]->e; + + if (e == NULL) + { + // NULL outgoing edges on a path can happen for jumping to a + // constant address. + cancel_thread (&path, "Found NULL edge in jump threading path"); + return true; + } + + if (loop->latch == e->src || loop->latch == e->dest) + seen_latch = true; + + // The first entry represents the block with an outgoing edge + // that we will redirect to the jump threading path. Thus we + // don't care about that block's loop father. + if ((i > 0 && e->src->loop_father != loop) + || e->dest->loop_father != loop) + path_crosses_loops = true; + + if (flag_checking && !m_backedge_threads) + gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0); + } + + if (cfun->curr_properties & PROP_loop_opts_done) + return false; + + if (seen_latch && empty_block_p (loop->latch)) + { + cancel_thread (&path, "Threading through latch before loop opts " + "would create non-empty latch"); + return true; + } + if (path_crosses_loops) + { + cancel_thread (&path, "Path crosses loops"); + return true; + } + return false; +} + /* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; }
- /* First make sure there are no NULL outgoing edges on the jump threading - path. That can happen for jumping to a constant address. */ - for (unsigned int i = 0; i < path->length (); i++) - { - if ((*path)[i]->e == NULL) - { - cancel_thread (path, "Found NULL edge in jump threading path"); - return false; - } - - if (flag_checking && !m_backedge_threads) - gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0); - } + if (cancel_invalid_paths (*path)) + return false;
if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true); diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0; + bool cancel_invalid_paths (vec<jump_thread_edge *> &path); jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader. </cut>
Hi Aldy,
Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 02:52, ci_notify@linaro.org wrote:
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Reproduce builds:
<cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh@redhat.com> Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily.
Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release.
As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust.
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } }
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps
- properly. There used to be 3, but one thread was crossing
- loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; }
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops),
- all of which will be realized, which will eliminate testing of
- FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was
- depending on early threaders threading through loop boundaries
- (which we shouldn't do). However, the late threading passes, which
- run after loop optimizations , can successfully eliminate the
- references to FLAG. Verify that ther are no references by the late
- threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1[1-9]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition.
All the cases are picked up by VRP1 as jump threads. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+/* There used to be 6 jump threads found by thread1, but they all
- depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */
#include <stdarg.h> #include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a;
/* In some architectures like ppc64, jump threading may thread
the iteration where i==0 such that we no longer optimize the
BB. Another alternative to disable jump threading would be
to wrap the read from `i' into a function returning i. */ if (arr[i] = i) a = i; else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; }
+bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{
- gcc_checking_assert (!path.is_empty ());
- edge taken_edge = path[path.length () - 1]->e;
- loop_p loop = taken_edge->src->loop_father;
- bool seen_latch = false;
- bool path_crosses_loops = false;
- for (unsigned int i = 0; i < path.length (); i++)
- {
edge e = path[i]->e;
if (e == NULL)
- {
// NULL outgoing edges on a path can happen for jumping to a
// constant address.
cancel_thread (&path, "Found NULL edge in jump threading path");
return true;
- }
if (loop->latch == e->src || loop->latch == e->dest)
- seen_latch = true;
// The first entry represents the block with an outgoing edge
// that we will redirect to the jump threading path. Thus we
// don't care about that block's loop father.
if ((i > 0 && e->src->loop_father != loop)
|| e->dest->loop_father != loop)
- path_crosses_loops = true;
if (flag_checking && !m_backedge_threads)
- gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
- if (cfun->curr_properties & PROP_loop_opts_done)
- return false;
- if (seen_latch && empty_block_p (loop->latch))
- {
cancel_thread (&path, "Threading through latch before loop opts "
"would create non-empty latch");
return true;
- }
- if (path_crosses_loops)
- {
cancel_thread (&path, "Path crosses loops");
return true;
- }
- return false;
+}
/* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; }
- /* First make sure there are no NULL outgoing edges on the jump threading
path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
if ((*path)[i]->e == NULL)
- {
cancel_thread (path, "Found NULL edge in jump threading path");
return false;
- }
if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
if (cancel_invalid_paths (*path))
return false;
if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0;
- bool cancel_invalid_paths (vec<jump_thread_edge *> &path); jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader.
</cut>
Also, it slightly increases code size of 450.soplex at -Os -flto:
https://lists.linaro.org/pipermail/linaro-toolchain/2021-September/007883.ht...
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 15:53, Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote:
Hi Aldy,
Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 02:52, ci_notify@linaro.org wrote:
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Reproduce builds:
<cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh@redhat.com> Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily.
Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release.
As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust.
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } }
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps
- properly. There used to be 3, but one thread was crossing
- loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; }
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops),
- all of which will be realized, which will eliminate testing of
- FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was
- depending on early threaders threading through loop boundaries
- (which we shouldn't do). However, the late threading passes, which
- run after loop optimizations , can successfully eliminate the
- references to FLAG. Verify that ther are no references by the late
- threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1[1-9]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition.
All the cases are picked up by VRP1 as jump threads. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+/* There used to be 6 jump threads found by thread1, but they all
- depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */
#include <stdarg.h> #include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a;
/* In some architectures like ppc64, jump threading may thread
the iteration where i==0 such that we no longer optimize the
BB. Another alternative to disable jump threading would be
if (arr[i] = i) a = i; elseto wrap the read from `i' into a function returning i. */
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; }
+bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{
- gcc_checking_assert (!path.is_empty ());
- edge taken_edge = path[path.length () - 1]->e;
- loop_p loop = taken_edge->src->loop_father;
- bool seen_latch = false;
- bool path_crosses_loops = false;
- for (unsigned int i = 0; i < path.length (); i++)
- {
edge e = path[i]->e;
if (e == NULL)
- {
// NULL outgoing edges on a path can happen for jumping to a
// constant address.
cancel_thread (&path, "Found NULL edge in jump threading path");
return true;
- }
if (loop->latch == e->src || loop->latch == e->dest)
- seen_latch = true;
// The first entry represents the block with an outgoing edge
// that we will redirect to the jump threading path. Thus we
// don't care about that block's loop father.
if ((i > 0 && e->src->loop_father != loop)
|| e->dest->loop_father != loop)
- path_crosses_loops = true;
if (flag_checking && !m_backedge_threads)
- gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
- if (cfun->curr_properties & PROP_loop_opts_done)
- return false;
- if (seen_latch && empty_block_p (loop->latch))
- {
cancel_thread (&path, "Threading through latch before loop opts "
"would create non-empty latch");
return true;
- }
- if (path_crosses_loops)
- {
cancel_thread (&path, "Path crosses loops");
return true;
- }
- return false;
+}
/* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; }
- /* First make sure there are no NULL outgoing edges on the jump threading
path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
if ((*path)[i]->e == NULL)
- {
cancel_thread (path, "Found NULL edge in jump threading path");
return false;
- }
if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
- if (cancel_invalid_paths (*path))
- return false;
if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true); diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0;
- bool cancel_invalid_paths (vec<jump_thread_edge *> &path);
jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader.
</cut>
[CCing Jeff and list for broader audience]
On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
Hi Aldy,
Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
First of all, thanks for chasing this down. It's incredibly useful to have these types of bug reports.
Jeff and I have been discussing the repercussions of adjusting the loop crossing restrictions in the various threaders. He's seen some regressions in embedded targets when disallowing certain corner cases of loop crossing threads causes all sorts of grief.
Out of curiosity, does the attached (untested) patch fix the regression?
Aldy
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 02:52, ci_notify@linaro.org wrote:
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Reproduce builds:
<cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh@redhat.com> Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry. My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course. Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily. Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs. This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths). We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release. As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations. Tested on x86-64 Linux. gcc/ChangeLog: * tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust.
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } }
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps
- properly. There used to be 3, but one thread was crossing
- loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; }
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops),
- all of which will be realized, which will eliminate testing of
- FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was
- depending on early threaders threading through loop boundaries
- (which we shouldn't do). However, the late threading passes, which
- run after loop optimizations , can successfully eliminate the
- references to FLAG. Verify that ther are no references by the late
- threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1[1-9]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition.
All the cases are picked up by VRP1 as jump threads. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+/* There used to be 6 jump threads found by thread1, but they all
- depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */
#include <stdarg.h> #include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a;
/* In some architectures like ppc64, jump threading may thread
the iteration where i==0 such that we no longer optimize the
BB. Another alternative to disable jump threading would be
to wrap the read from `i' into a function returning i. */ if (arr[i] = i) a = i; else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; }
+bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{
- gcc_checking_assert (!path.is_empty ());
- edge taken_edge = path[path.length () - 1]->e;
- loop_p loop = taken_edge->src->loop_father;
- bool seen_latch = false;
- bool path_crosses_loops = false;
- for (unsigned int i = 0; i < path.length (); i++)
- {
edge e = path[i]->e;
if (e == NULL)
- {
// NULL outgoing edges on a path can happen for jumping to a
// constant address.
cancel_thread (&path, "Found NULL edge in jump threading path");
return true;
- }
if (loop->latch == e->src || loop->latch == e->dest)
- seen_latch = true;
// The first entry represents the block with an outgoing edge
// that we will redirect to the jump threading path. Thus we
// don't care about that block's loop father.
if ((i > 0 && e->src->loop_father != loop)
|| e->dest->loop_father != loop)
- path_crosses_loops = true;
if (flag_checking && !m_backedge_threads)
- gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
- if (cfun->curr_properties & PROP_loop_opts_done)
- return false;
- if (seen_latch && empty_block_p (loop->latch))
- {
cancel_thread (&path, "Threading through latch before loop opts "
"would create non-empty latch");
return true;
- }
- if (path_crosses_loops)
- {
cancel_thread (&path, "Path crosses loops");
return true;
- }
- return false;
+}
/* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; }
- /* First make sure there are no NULL outgoing edges on the jump threading
path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
if ((*path)[i]->e == NULL)
- {
cancel_thread (path, "Found NULL edge in jump threading path");
return false;
- }
if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
if (cancel_invalid_paths (*path))
return false;
if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0;
- bool cancel_invalid_paths (vec<jump_thread_edge *> &path); jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader.
</cut>
On 27 Sep 2021, at 16:52, Aldy Hernandez aldyh@redhat.com wrote:
[CCing Jeff and list for broader audience]
On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
Hi Aldy, Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
First of all, thanks for chasing this down. It's incredibly useful to have these types of bug reports.
Thanks, Aldy, this is music to my ears :-).
We have built this automated benchmarking CI that bisects code-speed and code-size regressions down to a single commit. It is still work-in-progress, and I’m forwarding these reports to patch authors, whose patches caused regressions. If GCC community finds these useful, we can also setup posting to one of GCC’s mailing lists.
Jeff and I have been discussing the repercussions of adjusting the loop crossing restrictions in the various threaders. He's seen some regressions in embedded targets when disallowing certain corner cases of loop crossing threads causes all sorts of grief.
Out of curiosity, does the attached (untested) patch fix the regression?
I’ll test the patch and will follow up.
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
Aldy
Regards,
Maxim Kuvyrkov https://www.linaro.org
On 27 Sep 2021, at 02:52, ci_notify@linaro.org wrote:
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez aldyh@redhat.com
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm...
Reproduce builds:
<cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh@redhat.com> Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily.
Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release.
As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust.
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } }
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps
- properly. There used to be 3, but one thread was crossing
- loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; }
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops),
- all of which will be realized, which will eliminate testing of
- FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was
- depending on early threaders threading through loop boundaries
- (which we shouldn't do). However, the late threading passes, which
- run after loop optimizations , can successfully eliminate the
- references to FLAG. Verify that ther are no references by the late
- threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1[1-9]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition.
All the cases are picked up by VRP1 as jump threads. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+/* There used to be 6 jump threads found by thread1, but they all
- depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */
#include <stdarg.h> #include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a;
/* In some architectures like ppc64, jump threading may thread
the iteration where i==0 such that we no longer optimize the
BB. Another alternative to disable jump threading would be
to wrap the read from `i' into a function returning i. */ if (arr[i] = i) a = i; else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; }
+bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{
- gcc_checking_assert (!path.is_empty ());
- edge taken_edge = path[path.length () - 1]->e;
- loop_p loop = taken_edge->src->loop_father;
- bool seen_latch = false;
- bool path_crosses_loops = false;
- for (unsigned int i = 0; i < path.length (); i++)
- {
edge e = path[i]->e;
if (e == NULL)
- {
// NULL outgoing edges on a path can happen for jumping to a
// constant address.
cancel_thread (&path, "Found NULL edge in jump threading path");
return true;
- }
if (loop->latch == e->src || loop->latch == e->dest)
- seen_latch = true;
// The first entry represents the block with an outgoing edge
// that we will redirect to the jump threading path. Thus we
// don't care about that block's loop father.
if ((i > 0 && e->src->loop_father != loop)
|| e->dest->loop_father != loop)
- path_crosses_loops = true;
if (flag_checking && !m_backedge_threads)
- gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
- if (cfun->curr_properties & PROP_loop_opts_done)
- return false;
- if (seen_latch && empty_block_p (loop->latch))
- {
cancel_thread (&path, "Threading through latch before loop opts "
"would create non-empty latch");
return true;
- }
- if (path_crosses_loops)
- {
cancel_thread (&path, "Path crosses loops");
return true;
- }
- return false;
+}
/* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; }
- /* First make sure there are no NULL outgoing edges on the jump threading
path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
if ((*path)[i]->e == NULL)
- {
cancel_thread (path, "Found NULL edge in jump threading path");
return false;
- }
if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
if (cancel_invalid_paths (*path))
return false;
if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0;
- bool cancel_invalid_paths (vec<jump_thread_edge *> &path); jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader.
</cut>
<jeff.txt>
On 9/27/21 11:39 AM, Maxim Kuvyrkov via Gcc wrote:
On 27 Sep 2021, at 16:52, Aldy Hernandez aldyh@redhat.com wrote:
[CCing Jeff and list for broader audience]
On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
Hi Aldy, Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
First of all, thanks for chasing this down. It's incredibly useful to have these types of bug reports.
Thanks, Aldy, this is music to my ears :-).
We have built this automated benchmarking CI that bisects code-speed and code-size regressions down to a single commit. It is still work-in-progress, and I’m forwarding these reports to patch authors, whose patches caused regressions. If GCC community finds these useful, we can also setup posting to one of GCC’s mailing lists.
I second that this sort of thing is incredibly useful. I don't suppose its easy to do the reverse?... let patch authors know when they've caused a significant improvement? :-) That would be much less common I suspect, so perhaps not worth it :-)
Its certainly very useful when we are making a wholesale change to a pass which we think is beneficial, but aren't sure.
And a followup question... Sometimes we have no good way of determining the widespread run-time effects of a change. You seem to be running SPEC/other things continuously then? Does it run like once a day/some-time-period, and if you note a regression, narrow it down? Regardless, I think it could be very useful to be able to see the results of anything you do run at whatever frequency it happens.
On 27 Sep 2021, at 19:02, Andrew MacLeod amacleod@redhat.com wrote:
On 9/27/21 11:39 AM, Maxim Kuvyrkov via Gcc wrote:
On 27 Sep 2021, at 16:52, Aldy Hernandez aldyh@redhat.com wrote:
[CCing Jeff and list for broader audience]
On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
Hi Aldy, Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
First of all, thanks for chasing this down. It's incredibly useful to have these types of bug reports.
Thanks, Aldy, this is music to my ears :-).
We have built this automated benchmarking CI that bisects code-speed and code-size regressions down to a single commit. It is still work-in-progress, and I’m forwarding these reports to patch authors, whose patches caused regressions. If GCC community finds these useful, we can also setup posting to one of GCC’s mailing lists.
I second that this sort of thing is incredibly useful. I don't suppose its easy to do the reverse?... let patch authors know when they've caused a significant improvement? :-) That would be much less common I suspect, so perhaps not worth it :-)
We do this occasionally, when identifying a regression in a patch revert commit :-). Seriously, though, it’s an easy enough code-change to the metric, but we are maxing out our benchmarking capacity with current configuration matrix.
Its certainly very useful when we are making a wholesale change to a pass which we think is beneficial, but aren't sure.
And a followup question... Sometimes we have no good way of determining the widespread run-time effects of a change. You seem to be running SPEC/other things continuously then?
We continuously run SPEC CPU2006 on {arm,aarch64}-{-Os/-O2/-O3}-{no LTO/LTO} matrix for GNU and LLVM toolchains.
In the GNU toolchain we track master branches and latest-release branches of Binutils, GCC and Glibc — and detect code-speed and code-size regressions across all toolchain components.
Does it run like once a day/some-time-period, and if you note a regression, narrow it down?
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Regardless, I think it could be very useful to be able to see the results of anything you do run at whatever frequency it happens.
Thanks!
-- Maxim Kuvyrkov https://www.linaro.org
On 9/29/21 7:59 AM, Maxim Kuvyrkov wrote:
Does it run like once a day/some-time-period, and if you note a regression, narrow it down?
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Regardless, I think it could be very useful to be able to see the results of anything you do run at whatever frequency it happens.
Thanks!
--
One more follow on question.. is this information/summary of the results every 3rd day interval of master published anywhere? ie, to a web page or posted somewhere? that seems like it could useful, especially with a +/- differential from the previous run (which you obviously calculate to determine if there is a regression).
Anyway, I like it!
Andrew
On 29 Sep 2021, at 21:21, Andrew MacLeod amacleod@redhat.com wrote:
On 9/29/21 7:59 AM, Maxim Kuvyrkov wrote:
Does it run like once a day/some-time-period, and if you note a regression, narrow it down?
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Regardless, I think it could be very useful to be able to see the results of anything you do run at whatever frequency it happens.
Thanks!
--
One more follow on question.. is this information/summary of the results every 3rd day interval of master published anywhere? ie, to a web page or posted somewhere? that seems like it could useful, especially with a +/- differential from the previous run (which you obviously calculate to determine if there is a regression).
It’s our next big improvement — to provide a dashboard with current performance numbers and historical stats. Performance summary information is publicly available as artifacts in jenkins jobs (e.g., [1]), but one needs to know exactly where to look.
We plan to implement the dashboard before the end of the year.
We also have raw perf.data files and benchmark executables stashed for detailed inspection. I /think/, we can publish these for SPEC CPU2xxx benchmarks — they are all based on open-source software. For other benchmarks (EEMBC, CoreMark Pro) we can’t publish much beyond time/size metrics.
[1] https://ci.linaro.org/view/tcwg_bmk_ci_gnu/job/tcwg_bmk_ci_gnu-build-tcwg_bm...
Regards,
-- Maxim Kuvyrkov https://www.linaro.org
On Wed, 29 Sep 2021, Maxim Kuvyrkov via Gcc wrote:
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Yes, please! :-)
On Fri, 1 Oct 2021, Maxim Kuvyrkov via Gcc wrote:
It’s our next big improvement — to provide a dashboard with current performance numbers and historical stats.
Awesome. And then we can even link from gcc.gnu.org.
Gerald
Hi,
On Fri, Oct 01 2021, Gerald Pfeifer wrote:
On Wed, 29 Sep 2021, Maxim Kuvyrkov via Gcc wrote:
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Yes, please! :-)
On Fri, 1 Oct 2021, Maxim Kuvyrkov via Gcc wrote:
It’s our next big improvement — to provide a dashboard with current performance numbers and historical stats.
Awesome. And then we can even link from gcc.gnu.org.
You all are aware of the openSUSE LNT periodic SPEC benchmarker, right? Martin may explain better how to move around it, but the two most interesting result pages are:
- https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report and - https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch
Martin
On 8 Oct 2021, at 13:22, Martin Jambor mjambor@suse.cz wrote:
Hi,
On Fri, Oct 01 2021, Gerald Pfeifer wrote:
On Wed, 29 Sep 2021, Maxim Kuvyrkov via Gcc wrote:
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Yes, please! :-)
On Fri, 1 Oct 2021, Maxim Kuvyrkov via Gcc wrote:
It’s our next big improvement — to provide a dashboard with current performance numbers and historical stats.
Awesome. And then we can even link from gcc.gnu.org.
You all are aware of the openSUSE LNT periodic SPEC benchmarker, right? Martin may explain better how to move around it, but the two most interesting result pages are:
Hi Martin,
The novel part of TCWG CI is that it bisects “regressions” down to a single commit, thus pin-pointing the interesting commit, and can send out notifications to patch authors.
We do generate a fair number of benchmarking data for AArch64 and AArch32, and I want to have them plotted somewhere. I have started to put together an LNT instance to do that, but after a couple of days I couldn't figure out the setup. Could you share the configuration of your LNT instance? Or, perhaps, make it open to the community so that others can upload the results?
Thanks,
-- Maxim Kuvyrkov https://www.linaro.org
On 10/11/21 13:05, Maxim Kuvyrkov wrote:
On 8 Oct 2021, at 13:22, Martin Jambor mjambor@suse.cz wrote:
Hi,
On Fri, Oct 01 2021, Gerald Pfeifer wrote:
On Wed, 29 Sep 2021, Maxim Kuvyrkov via Gcc wrote:
Configurations that track master branches have 3-day intervals. Configurations that track release branches — 6 days. If a regression is detected it is narrowed down to component first — binutils, gcc or glibc — and then the commit range of the component is bisected down to a specific commit. All. Done. Automatically.
I will make a presentation on this CI at the next GNU Tools Cauldron.
Yes, please! :-)
On Fri, 1 Oct 2021, Maxim Kuvyrkov via Gcc wrote:
It’s our next big improvement — to provide a dashboard with current performance numbers and historical stats.
Awesome. And then we can even link from gcc.gnu.org.
You all are aware of the openSUSE LNT periodic SPEC benchmarker, right? Martin may explain better how to move around it, but the two most interesting result pages are:
Hi Martin,
The novel part of TCWG CI is that it bisects “regressions” down to a single commit, thus pin-pointing the interesting commit, and can send out notifications to patch authors.
Hello Maxim.
We do generate a fair number of benchmarking data for AArch64 and AArch32, and I want to have them plotted somewhere. I have started to put together an LNT instance to do that, but after a couple of days I couldn't figure out the setup. Could you share the configuration of your LNT instance? Or, perhaps, make it open to the community so that others can upload the results?
Sure, I would be more than happy sharing our LNT configuration. Note we don't use the vanilla version, because it does not support git revisions (so that we use $timeshamp.$hash), and modified LNT GUI can interpret that.
As Martin mentioned, the useful page latest_runs_report is upstreamed by me: https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report
and these pages: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/tuning
Do rely on special naming scheme of Machines, e.g.: benzen.spec2006.gcc-10.Ofast_generic
and a custom modification of LNT generates the pages. I can share it with you as well.
@Maxim: Please write me a private email and I can share all the details you need.
About the public LNT instance, we are likely not willing to share it right now.
Cheers, Martin
Thanks,
-- Maxim Kuvyrkov https://www.linaro.org
On 9/27/2021 7:52 AM, Aldy Hernandez wrote:
[CCing Jeff and list for broader audience]
On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
Hi Aldy,
Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please take a look if this is something that could be easily fixed?
First of all, thanks for chasing this down. It's incredibly useful to have these types of bug reports.
Jeff and I have been discussing the repercussions of adjusting the loop crossing restrictions in the various threaders. He's seen some regressions in embedded targets when disallowing certain corner cases of loop crossing threads causes all sorts of grief.
Out of curiosity, does the attached (untested) patch fix the regression?
And just a note, that patch doesn't seem to fix the regressions on visium or rl78. I haven't checked any of the other regressing targets yet.
jeff
linaro-toolchain@lists.linaro.org