[TCWG CI] 400.perlbench slowed down by 6% after llvm: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest - linaro-toolchain

24 Sep 2021

After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
Author: Arthur Eubanks aeubanks@google.com
[SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
the following benchmarks slowed down by more than 2%:
- 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
  - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
 - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Reproduce builds:
<cut>
mkdir investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
cd investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 32a50078657dd8beead327a3478ede4e9d730432
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
Author: Arthur Eubanks aeubanks@google.com
Date:   Fri Aug 27 12:32:59 2021 -0700
[SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
When determining whether to fold branches to a common destination by
    merging two blocks, SimplifyCFG will count the number of instructions to
    be moved into the first basic block. However, there's no reason to count
    free instructions like bitcasts and other similar instructions.
This resolves missed branch foldings with -fstrict-vtable-pointers in
    llvm-test-suite's lambda benchmark.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D108837
---
 llvm/lib/Transforms/Utils/SimplifyCFG.cpp          | 17 ++++++-----
 llvm/test/CodeGen/AArch64/csr-split.ll             | 34 +++++++++++-----------
 .../fold-branch-to-common-dest-free-cost.ll        |  5 ++--
 3 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index 2ff98b238de0..a3bd89e72af9 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -3258,13 +3258,16 @@ bool llvm::FoldBranchToCommonDest(BranchInst *BI, DomTreeUpdater *DTU,
     SawVectorOp |= isVectorOp(I);
// Account for the cost of duplicating this instruction into each
-    // predecessor.
-    NumBonusInsts += PredCount;
-
-    // Early exits once we reach the limit.
-    if (NumBonusInsts >
-        BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier)
-      return false;
+    // predecessor. Ignore free instructions.
+    if (!TTI ||
+        TTI->getUserCost(&I, CostKind) != TargetTransformInfo::TCC_Free) {
+      NumBonusInsts += PredCount;
+
+      // Early exits once we reach the limit.
+      if (NumBonusInsts >
+          BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier)
+        return false;
+    }
auto IsBCSSAUse = [BB, &I](Use &U) {
       auto *UI = cast<Instruction>(U.getUser());
diff --git a/llvm/test/CodeGen/AArch64/csr-split.ll b/llvm/test/CodeGen/AArch64/csr-split.ll
index 1bee7f05acec..de85b4313433 100644
--- a/llvm/test/CodeGen/AArch64/csr-split.ll
+++ b/llvm/test/CodeGen/AArch64/csr-split.ll
@@ -82,22 +82,22 @@ define dso_local signext i32 @test2(i32* %p1) local_unnamed_addr  {
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-NEXT:    .cfi_offset w19, -8
 ; CHECK-NEXT:    .cfi_offset w30, -16
-; CHECK-NEXT:    cbz x0, .LBB1_2
-; CHECK-NEXT:  // %bb.1: // %if.end
+; CHECK-NEXT:    cbz x0, .LBB1_3
+; CHECK-NEXT:  // %bb.1: // %entry
 ; CHECK-NEXT:    adrp x8, a
 ; CHECK-NEXT:    ldrsw x8, [x8, :lo12:a]
 ; CHECK-NEXT:    mov x19, x0
 ; CHECK-NEXT:    cmp x8, x0
-; CHECK-NEXT:    b.eq .LBB1_3
-; CHECK-NEXT:  .LBB1_2: // %return
-; CHECK-NEXT:    mov w0, wzr
-; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
-; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB1_3: // %if.then2
+; CHECK-NEXT:    b.ne .LBB1_3
+; CHECK-NEXT:  // %bb.2: // %if.then2
 ; CHECK-NEXT:    bl callVoid
 ; CHECK-NEXT:    mov x0, x19
 ; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
 ; CHECK-NEXT:    b callNonVoid
+; CHECK-NEXT:  .LBB1_3: // %return
+; CHECK-NEXT:    mov w0, wzr
+; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
+; CHECK-NEXT:    ret
 ;
 ; CHECK-APPLE-LABEL: test2:
 ; CHECK-APPLE:       ; %bb.0: ; %entry
@@ -108,26 +108,26 @@ define dso_local signext i32 @test2(i32* %p1) local_unnamed_addr  {
 ; CHECK-APPLE-NEXT:    .cfi_offset w29, -16
 ; CHECK-APPLE-NEXT:    .cfi_offset w19, -24
 ; CHECK-APPLE-NEXT:    .cfi_offset w20, -32
-; CHECK-APPLE-NEXT:    cbz x0, LBB1_2
-; CHECK-APPLE-NEXT:  ; %bb.1: ; %if.end
+; CHECK-APPLE-NEXT:    cbz x0, LBB1_3
+; CHECK-APPLE-NEXT:  ; %bb.1: ; %entry
 ; CHECK-APPLE-NEXT:  Lloh2:
 ; CHECK-APPLE-NEXT:    adrp x8, _a@PAGE
 ; CHECK-APPLE-NEXT:  Lloh3:
 ; CHECK-APPLE-NEXT:    ldrsw x8, [x8, _a@PAGEOFF]
 ; CHECK-APPLE-NEXT:    mov x19, x0
 ; CHECK-APPLE-NEXT:    cmp x8, x0
-; CHECK-APPLE-NEXT:    b.eq LBB1_3
-; CHECK-APPLE-NEXT:  LBB1_2: ; %return
-; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
-; CHECK-APPLE-NEXT:    mov w0, wzr
-; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
-; CHECK-APPLE-NEXT:    ret
-; CHECK-APPLE-NEXT:  LBB1_3: ; %if.then2
+; CHECK-APPLE-NEXT:    b.ne LBB1_3
+; CHECK-APPLE-NEXT:  ; %bb.2: ; %if.then2
 ; CHECK-APPLE-NEXT:    bl _callVoid
 ; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
 ; CHECK-APPLE-NEXT:    mov x0, x19
 ; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
 ; CHECK-APPLE-NEXT:    b _callNonVoid
+; CHECK-APPLE-NEXT:  LBB1_3: ; %return
+; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
+; CHECK-APPLE-NEXT:    mov w0, wzr
+; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
+; CHECK-APPLE-NEXT:    ret
 ; CHECK-APPLE-NEXT:    .loh AdrpLdr Lloh2, Lloh3
 entry:
   %tobool = icmp eq i32* %p1, null
diff --git a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
index ace2a5ed35ca..27df5ec44582 100644
--- a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
+++ b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
@@ -8,12 +8,11 @@ declare void @g2()
define void @f(i8* %a, i8* %b, i1 %c, i1 %d, i1 %e) {
 ; CHECK-LABEL: @f(
-; CHECK-NEXT:    br i1 [[C:%.*]], label [[L1:%.*]], label [[L3:%.*]]
-; CHECK:       l1:
 ; CHECK-NEXT:    [[A1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* [[A:%.*]])
 ; CHECK-NEXT:    [[B1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* [[B:%.*]])
 ; CHECK-NEXT:    [[I:%.*]] = icmp eq i8* [[A1]], [[B1]]
-; CHECK-NEXT:    br i1 [[I]], label [[L2:%.*]], label [[L3]]
+; CHECK-NEXT:    [[OR_COND:%.*]] = select i1 [[C:%.*]], i1 [[I]], i1 false
+; CHECK-NEXT:    br i1 [[OR_COND]], label [[L2:%.*]], label [[L3:%.*]]
 ; CHECK:       l2:
 ; CHECK-NEXT:    call void @g1()
 ; CHECK-NEXT:    br label [[RET:%.*]]
</cut>