[TCWG CI] 464.h264ref slowed down by 7% after llvm: [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes

6 Nov 2021

After llvm commit 9c2469c1ddb34517de8dafd83d1940deada3fc22
Author: Roman Lebedev lebedev.ri@gmail.com
[PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes
the following benchmarks slowed down by more than 2%:
- 464.h264ref slowed down by 7% from 10836 to 11596 perf samples
  - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 46% from 1525 to 2231 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
 - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Reproduce builds:
<cut>
mkdir investigate-llvm-9c2469c1ddb34517de8dafd83d1940deada3fc22
cd investigate-llvm-9c2469c1ddb34517de8dafd83d1940deada3fc22
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 9c2469c1ddb34517de8dafd83d1940deada3fc22
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 4bef0304e153c757c9f42c2001d4c56e8f99929e
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 9c2469c1ddb34517de8dafd83d1940deada3fc22
Author: Roman Lebedev lebedev.ri@gmail.com
Date:   Wed Nov 3 19:23:25 2021 +0300
[PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes
Test thanks to Michael Kuklinski from `#llvm`: https://godbolt.org/z/bdrah5Goo
    originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-si...
We manage to deduce that the answer does not require looping,
    but we do that after the last `LoopDeletion` pass run,
    so we end up being stuck with a dead loop.
Now, as with all things SCEV, this has
    a very expected ~`+0.12%` compile time performance regression:
    https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a9...
    (for comparison, doing that in function simplification pipeline
    would have been ~`+0.5` compile time performance regression, D112840)
Looking at the transformation ``` | statistic name |--------------------------- | scalar-evolution.NumBruteF | scalar-evolution.NumTripCo | loop-delete.NumBackedgesBroken | regalloc.numExtends | indvars.NumFoldedUser | indvars.NumElimCmp | scalar-evolution.NumTripCountsComputed | loop-delete.NumDeleted | machine-cse.NumCommutes | globaldce.NumFunctions | codegenprepare.NumSelectsExpanded | loop-unroll.NumRuntimeUnrolled | machinelicm.NumPostRAHoisted | phi-node-elimination.NumCr | machine-cse.NumPREs | branch-folder.NumBranchOpts | loop-unroll.NumUnrolled | branch-folder.NumDeadBlocks | codegenprepare.NumBlocksElim | instsimplify.NumSimplified | instcombine.NumConstProp | instsimplify.NumExpand | loop-unroll.NumCompletelyUnrolled | branch-folder.NumHoist | regalloc.NumReloadsRemoved | regalloc.NumSnippets | machine-cse.NumCrossBBCSEs | machinelicm.NumCSEed | branch-folder.NumTailMerge | codegenprepare.NumExtUses | local.NumRemoved | loop-vectorize.LoopsAnalyzed ```

stats over vanilla test-suite, i think it's rather expected: |  baseline |  proposed |     Δ |      % |    |%| | -----------------------|----------:|----------:|------:|-------:|-------:| orceTripCountsComputed |       789 |       888 |    99 | 12.55% | 12.55% | untsNotComputed        |    105592 |    117900 | 12308 | 11.66% | 11.66% | |       542 |       559 |    17 |  3.14% |  3.14% | |        81 |        79 |    -2 | -2.47% |  2.47% | |       408 |       400 |    -8 | -1.96% |  1.96% | |      3831 |      3758 |   -73 | -1.91% |  1.91% | |    299759 |    304278 |  4519 |  1.51% |  1.51% | |      8055 |      8128 |    73 |  0.91% |  0.91% | |       111 |       110 |    -1 | -0.90% |  0.90% | |      1187 |      1192 |     5 |  0.42% |  0.42% | |       277 |       278 |     1 |  0.36% |  0.36% | |     13841 |     13791 |   -50 | -0.36% |  0.36% | |      1168 |      1172 |     4 |  0.34% |  0.34% | iticalEdgesSplit       |     83054 |     82879 |  -175 | -0.21% |  0.21% | |      3085 |      3079 |    -6 | -0.19% |  0.19% | |    108122 |    107942 |  -180 | -0.17% |  0.17% | |     40136 |     40067 |   -69 | -0.17% |  0.17% | |    130818 |    130607 |  -211 | -0.16% |  0.16% | |     92856 |     92714 |  -142 | -0.15% |  0.15% | |    103263 |    103129 |  -134 | -0.13% |  0.13% | |     26070 |     26102 |    32 |  0.12% |  0.12% | |      1716 |      1718 |     2 |  0.12% |  0.12% | |      9236 |      9225 |   -11 | -0.12% |  0.12% | |      2773 |      2770 |    -3 | -0.11% |  0.11% | |     10822 |     10834 |    12 |  0.11% |  0.11% | |     11394 |     11406 |    12 |  0.11% |  0.11% | |      1052 |      1053 |     1 |  0.10% |  0.10% | |     99887 |     99784 |  -103 | -0.10% |  0.10% | |     72501 |     72435 |   -66 | -0.09% |  0.09% | |     22007 |     21987 |   -20 | -0.09% |  0.09% | |     68232 |     68294 |    62 |  0.09% |  0.09% | |     75483 |     75413 |   -70 | -0.09% |  0.09% |
Note that i'm only changing current PM, and not touching obsolete PM.
This is an alternative to the function simplification pipeline variant
    of the same change, D112840. It has both less compile time impact
    (since the additional number of SCEV trip count calculations
    is way lass less than with the D112840), and it is
    much more powerful/impactful (almost 2x more loops deleted).
I have checked, and doing this after loop rotation
    is favorable (more loops deleted).
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D112851
---
 llvm/lib/Passes/PassBuilderPipelines.cpp           |  9 +++-
 llvm/test/Other/new-pm-defaults.ll                 |  1 +
 llvm/test/Other/new-pm-thinlto-defaults.ll         |  1 +
 .../Other/new-pm-thinlto-postlink-pgo-defaults.ll  |  1 +
 .../new-pm-thinlto-postlink-samplepgo-defaults.ll  |  1 +
 ...letion-of-loops-that-became-side-effect-free.ll | 49 ++++------------------
 6 files changed, 18 insertions(+), 44 deletions(-)

diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 2009a687ae7d..f0f7803ed3ae 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1093,11 +1093,16 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
   for (auto &C : VectorizerStartEPCallbacks)
     C(OptimizePM, Level);
+  LoopPassManager LPM;
   // First rotate loops that may have been un-rotated by prior passes.
   // Disable header duplication at -Oz.
+  LPM.addPass(LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink));
+  // Some loops may have become dead by now. Try to delete them.
+  // FIXME: see disscussion in https://reviews.llvm.org/D112851
+  //        this may need to be revisited once GVN is more powerful.
+  LPM.addPass(LoopDeletionPass());
   OptimizePM.addPass(createFunctionToLoopPassAdaptor(
-      LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink),
-      /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
+      std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
// Distribute loops to allow partial vectorization.  I.e. isolate dependences
   // into separate loop that would otherwise inhibit vectorization.  This is
diff --git a/llvm/test/Other/new-pm-defaults.ll b/llvm/test/Other/new-pm-defaults.ll
index 5067b6fbdd18..b9f90dad8224 100644
--- a/llvm/test/Other/new-pm-defaults.ll
+++ b/llvm/test/Other/new-pm-defaults.ll
@@ -216,6 +216,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass
diff --git a/llvm/test/Other/new-pm-thinlto-defaults.ll b/llvm/test/Other/new-pm-thinlto-defaults.ll
index 1f52fe47ae73..7836de5c6cce 100644
--- a/llvm/test/Other/new-pm-thinlto-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-defaults.ll
@@ -196,6 +196,7 @@
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-POSTLINK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-POSTLINK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopVectorizePass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
index 3a80efba3c56..e66e8672358c 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
@@ -167,6 +167,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass on foo
 ; CHECK-O-NEXT: Running pass: LCSSAPass on foo
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
index 2e822b21f8a1..410841124c8e 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
@@ -179,6 +179,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass
diff --git a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
index ec8db3cceeb1..99a52acd3b2b 100644
--- a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
+++ b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
@@ -11,17 +11,8 @@
 define dso_local zeroext i1 @is_not_empty_variant1(%struct.node* %p) {
 ; ALL-LABEL: @is_not_empty_variant1(
 ; ALL-NEXT:  entry:
-; ALL-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; ALL-NEXT:    br i1 [[TOBOOL_NOT3_I]], label [[COUNT_NODES_VARIANT1_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; ALL:       while.body.i:
-; ALL-NEXT:    [[P_ADDR_04_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY:%.*]] ]
-; ALL-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_04_I]], i64 0, i32 0
-; ALL-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; ALL-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; ALL-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT1_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]]
-; ALL:       count_nodes_variant1.exit:
-; ALL-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT3_I]], true
-; ALL-NEXT:    ret i1 [[TMP1]]
+; ALL-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; ALL-NEXT:    ret i1 [[TOBOOL_NOT3_I]]
 ;
 entry:
   %p.addr = alloca %struct.node*, align 8
@@ -113,39 +104,13 @@ while.end:
 define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) {
 ; O3-LABEL: @is_not_empty_variant3(
 ; O3-NEXT:  entry:
-; O3-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; O3-NEXT:    br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; O3:       while.body.i:
-; O3-NEXT:    [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ]
-; O3-NEXT:    [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ]
-; O3-NEXT:    [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1
-; O3-NEXT:    tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]]
-; O3-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0
-; O3-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; O3-NEXT:    [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1
-; O3-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O3-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
-; O3:       count_nodes_variant3.exit:
-; O3-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true
-; O3-NEXT:    ret i1 [[TMP1]]
+; O3-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; O3-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 ; O2-LABEL: @is_not_empty_variant3(
 ; O2-NEXT:  entry:
-; O2-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; O2-NEXT:    br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; O2:       while.body.i:
-; O2-NEXT:    [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ]
-; O2-NEXT:    [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ]
-; O2-NEXT:    [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1
-; O2-NEXT:    tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]]
-; O2-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0
-; O2-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; O2-NEXT:    [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1
-; O2-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O2-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
-; O2:       count_nodes_variant3.exit:
-; O2-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true
-; O2-NEXT:    ret i1 [[TMP1]]
+; O2-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; O2-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 ; O1-LABEL: @is_not_empty_variant3(
 ; O1-NEXT:  entry:
@@ -160,7 +125,7 @@ define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) {
 ; O1-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
 ; O1-NEXT:    [[INC_I]] = add i64 [[SIZE_06_I]], 1
 ; O1-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O1-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
+; O1-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]]
 ; O1:       count_nodes_variant3.exit.loopexit:
 ; O1-NEXT:    [[PHI_CMP:%.*]] = icmp ne i64 [[INC_I]], 0
 ; O1-NEXT:    br label [[COUNT_NODES_VARIANT3_EXIT]]
</cut>

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes