[TCWG CI] 464.h264ref slowed down by 7% after llvm: [BasicAA] Handle known bits as ranges - linaro-toolchain

28 Oct 2021

After llvm commit fbc0c308d599fe3300ab6516650b65b41979446d
Author: Nikita Popov nikita.ppv@gmail.com
[BasicAA] Handle known bits as ranges
the following benchmarks slowed down by more than 2%:
- 464.h264ref slowed down by 7% from 10899 to 11610 perf samples
  - 464.h264ref:libc.so.6 slowed down by 11% from 3538 to 3922 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2 -flto
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
 - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Reproduce builds:
<cut>
mkdir investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d
cd investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach fbc0c308d599fe3300ab6516650b65b41979446d
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 30a3652b6ade43504087f6e3acd8dc879055f501
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit fbc0c308d599fe3300ab6516650b65b41979446d
Author: Nikita Popov nikita.ppv@gmail.com
Date:   Mon Oct 25 15:47:21 2021 +0200
[BasicAA] Handle known bits as ranges
BasicAA currently tries to determine that the offset is positive by
    checking whether all variable indices are positive based on known
    bits, multiplied by a positive scale. However, this is incorrect
    if the scale multiplication might overflow. In the modified test
    case the original value is positive, but may be negative after a
    left shift.
Fix this by converting known bits into a constant range and reusing
    the range-based logic, which handles overflow correctly.
Differential Revision: https://reviews.llvm.org/D112611
---
 llvm/lib/Analysis/BasicAliasAnalysis.cpp           | 51 +++++-----------------
 .../test/Analysis/BasicAA/assume-index-positive.ll |  4 +-
 2 files changed, 12 insertions(+), 43 deletions(-)

diff --git a/llvm/lib/Analysis/BasicAliasAnalysis.cpp b/llvm/lib/Analysis/BasicAliasAnalysis.cpp
index 0305732ca5d5..8cf947c43bf4 100644
--- a/llvm/lib/Analysis/BasicAliasAnalysis.cpp
+++ b/llvm/lib/Analysis/BasicAliasAnalysis.cpp
@@ -318,15 +318,6 @@ struct CastedValue {
     return N;
   }
-  KnownBits evaluateWith(KnownBits N) const {
-    assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() &&
-           "Incompatible bit width");
-    if (TruncBits) N = N.trunc(N.getBitWidth() - TruncBits);
-    if (SExtBits) N = N.sext(N.getBitWidth() + SExtBits);
-    if (ZExtBits) N = N.zext(N.getBitWidth() + ZExtBits);
-    return N;
-  }
-
   ConstantRange evaluateWith(ConstantRange N) const {
     assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() &&
            "Incompatible bit width");
@@ -1250,8 +1241,6 @@ AliasResult BasicAAResult::aliasGEP(
if (!DecompGEP1.VarIndices.empty()) {
     APInt GCD;
-    bool AllNonNegative = DecompGEP1.Offset.isNonNegative();
-    bool AllNonPositive = DecompGEP1.Offset.isNonPositive();
     ConstantRange OffsetRange = ConstantRange(DecompGEP1.Offset);
     for (unsigned i = 0, e = DecompGEP1.VarIndices.size(); i != e; ++i) {
       const VariableGEPIndex &Index = DecompGEP1.VarIndices[i];
@@ -1266,24 +1255,19 @@ AliasResult BasicAAResult::aliasGEP(
       else
         GCD = APIntOps::GreatestCommonDivisor(GCD, ScaleForGCD.abs());
-      if (AllNonNegative || AllNonPositive) {
-        KnownBits Known = Index.Val.evaluateWith(
-            computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT));
-        bool SignKnownZero = Known.isNonNegative();
-        bool SignKnownOne = Known.isNegative();
-        AllNonNegative &= (SignKnownZero && Scale.isNonNegative()) ||
-                          (SignKnownOne && Scale.isNonPositive());
-        AllNonPositive &= (SignKnownZero && Scale.isNonPositive()) ||
-                          (SignKnownOne && Scale.isNonNegative());
-      }
+      ConstantRange CR =
+          computeConstantRange(Index.Val.V, true, &AC, Index.CxtI);
+      KnownBits Known =
+          computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT);
+      CR = CR.intersectWith(
+          ConstantRange::fromKnownBits(Known, /* Signed */ true),
+          ConstantRange::Signed);
assert(OffsetRange.getBitWidth() == Scale.getBitWidth() &&
              "Bit widths are normalized to MaxPointerSize");
-      OffsetRange = OffsetRange.add(Index.Val
-                            .evaluateWith(computeConstantRange(
-                                Index.Val.V, true, &AC, Index.CxtI))
-                            .sextOrTrunc(OffsetRange.getBitWidth())
-                            .smul_fast(ConstantRange(Scale)));
+      OffsetRange = OffsetRange.add(
+          Index.Val.evaluateWith(CR).sextOrTrunc(OffsetRange.getBitWidth())
+                                    .smul_fast(ConstantRange(Scale)));
     }
// We now have accesses at two offsets from the same base:
@@ -1300,21 +1284,6 @@ AliasResult BasicAAResult::aliasGEP(
         (GCD - ModOffset).uge(V1Size.getValue()))
       return AliasResult::NoAlias;
-    // If we know all the variables are non-negative, then the total offset is
-    // also non-negative and >= DecompGEP1.Offset. We have the following layout:
-    // [0, V2Size) ... [TotalOffset, TotalOffer+V1Size]
-    // If DecompGEP1.Offset >= V2Size, the accesses don't alias.
-    if (AllNonNegative && V2Size.hasValue() &&
-        DecompGEP1.Offset.uge(V2Size.getValue()))
-      return AliasResult::NoAlias;
-    // Similarly, if the variables are non-positive, then the total offset is
-    // also non-positive and <= DecompGEP1.Offset. We have the following layout:
-    // [TotalOffset, TotalOffset+V1Size) ... [0, V2Size)
-    // If -DecompGEP1.Offset >= V1Size, the accesses don't alias.
-    if (AllNonPositive && V1Size.hasValue() &&
-        (-DecompGEP1.Offset).uge(V1Size.getValue()))
-      return AliasResult::NoAlias;
-
     if (V1Size.hasValue() && V2Size.hasValue()) {
       // Compute ranges of potentially accessed bytes for both accesses. If the
       // interseciton is empty, there can be no overlap.
diff --git a/llvm/test/Analysis/BasicAA/assume-index-positive.ll b/llvm/test/Analysis/BasicAA/assume-index-positive.ll
index 451592067f4b..a53fff2c6009 100644
--- a/llvm/test/Analysis/BasicAA/assume-index-positive.ll
+++ b/llvm/test/Analysis/BasicAA/assume-index-positive.ll
@@ -130,12 +130,12 @@ define void @symmetry([0 x i8]* %ptr, i32 %a, i32 %b, i32 %c) {
   ret void
 }
-; TODO: %ptr.neg and %ptr.shl may alias, as the shl renders the previously
+; %ptr.neg and %ptr.shl may alias, as the shl renders the previously
 ; non-negative value potentially negative.
 define void @shl_of_non_negative(i8* %ptr, i64 %a) {
 ; CHECK-LABEL: Function: shl_of_non_negative
 ; CHECK: NoAlias: i8* %ptr.a, i8* %ptr.neg
-; CHECK: NoAlias: i8* %ptr.neg, i8* %ptr.shl
+; CHECK: MayAlias: i8* %ptr.neg, i8* %ptr.shl
   %a.cmp = icmp sge i64 %a, 0
   call void @llvm.assume(i1 %a.cmp)
   %ptr.neg = getelementptr i8, i8* %ptr, i64 -2
</cut>