[CI-NOTIFY]: TCWG Bisect tcwg_bmk_tx1/llvm-master-aarch64-spec2k6-O2_LTO - Build # 16 - Successful! - linaro-toolchain

6 Jul 2021

Successfully identified regression in *llvm* in CI configuration tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO.  So far, this commit has regressed CI configurations:
 - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
Culprit:
<cut>
commit 6998f8ae2d14e096aff33968f226587b5c1a193a
Author: David Sherwood david.sherwood@arm.com
Date:   Wed Mar 10 08:34:19 2021 +0000
[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in
    getInstructionCost when isScalarAfterVectorization() returns a true value.
    There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
      return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
    in practice:
1. VF is a scalar, in which case N = 1.
    2. VF is a vector. We can only get here if: a) the instruction is a
    GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
    variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
    the cost is always 0, since this is calculated later on as part of the
    load/store cost. PHI nodes are costed separately and were never previously
    multiplied by VF. For all other cases I have added an assert that none of
    the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
    add instruction to have been wrong, since only one copy remains after
    vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly
    into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718
</cut>
Results regressed to (for first_bad == 6998f8ae2d14e096aff33968f226587b5c1a193a)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer:
-8
# build_abe linux:
-7
# build_abe glibc:
-6
# build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer:
-5
# build_llvm true:
-3
# true:
0
# benchmark -O2_LTO -- artifacts/build-6998f8ae2d14e096aff33968f226587b5c1a193a/results_id:
1
# 462.libquantum,libquantum_base.default                        regressed by 113
from (for last_good == c835630c25a4f9925517949579f66a43b113fbc9)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer:
-8
# build_abe linux:
-7
# build_abe glibc:
-6
# build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer:
-5
# build_llvm true:
-3
# true:
0
# benchmark -O2_LTO -- artifacts/build-c835630c25a4f9925517949579f66a43b113fbc9/results_id:
1
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Results ID of last_good: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-master-aarch64-spec2k6-O2_LTO/1050
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Results ID of first_bad: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-master-aarch64-spec2k6-O2_LTO/1048
Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration details:
Reproduce builds:
<cut>
mkdir investigate-llvm-6998f8ae2d14e096aff33968f226587b5c1a193a
cd investigate-llvm-6998f8ae2d14e096aff33968f226587b5c1a193a
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
cd llvm
# Reproduce first_bad build
git checkout --detach 6998f8ae2d14e096aff33968f226587b5c1a193a
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach c835630c25a4f9925517949579f66a43b113fbc9
../artifacts/test.sh
cd ..
</cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/c...
Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Build log: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Full commit (up to 1000 lines):
<cut>
commit 6998f8ae2d14e096aff33968f226587b5c1a193a
Author: David Sherwood david.sherwood@arm.com
Date:   Wed Mar 10 08:34:19 2021 +0000
[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in
    getInstructionCost when isScalarAfterVectorization() returns a true value.
    There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
      return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
    in practice:
1. VF is a scalar, in which case N = 1.
    2. VF is a vector. We can only get here if: a) the instruction is a
    GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
    variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
    the cost is always 0, since this is calculated later on as part of the
    load/store cost. PHI nodes are costed separately and were never previously
    multiplied by VF. For all other cases I have added an assert that none of
    the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
    add instruction to have been wrong, since only one copy remains after
    vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly
    into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718
---
 llvm/lib/Transforms/Vectorize/LoopVectorize.cpp    | 73 +++++++++++++---------
 .../AArch64/no_vector_instructions.ll              |  2 +-
 .../LoopVectorize/AArch64/predication_costs.ll     | 35 +++++++++++
 .../Transforms/LoopVectorize/scalarized-bitcast.ll | 40 ++++++++++++
 4 files changed, 121 insertions(+), 29 deletions(-)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2b413fc49505..f25af23c86c2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7383,10 +7383,39 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
   Type *RetTy = I->getType();
   if (canTruncateToMinimalBitwidth(I, VF))
     RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
-  VectorTy = isScalarAfterVectorization(I, VF) ? RetTy : ToVectorTy(RetTy, VF);
   auto SE = PSE.getSE();
   TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  auto hasSingleCopyAfterVectorization = [this](Instruction *I,
+                                                ElementCount VF) -> bool {
+    if (VF.isScalar())
+      return true;
+
+    auto Scalarized = InstsToScalarize.find(VF);
+    assert(Scalarized != InstsToScalarize.end() &&
+           "VF not yet analyzed for scalarization profitability");
+    return !Scalarized->second.count(I) &&
+           llvm::all_of(I->users(), [&](User *U) {
+             auto *UI = cast<Instruction>(U);
+             return !Scalarized->second.count(UI);
+           });
+  };
+
+  if (isScalarAfterVectorization(I, VF)) {
+    // With the exception of GEPs and PHIs, after scalarization there should
+    // only be one copy of the instruction generated in the loop. This is
+    // because the VF is either 1, or any instructions that need scalarizing
+    // have already been dealt with by the the time we get here. As a result,
+    // it means we don't have to multiply the instruction cost by VF.
+    assert(I->getOpcode() == Instruction::GetElementPtr ||
+           I->getOpcode() == Instruction::PHI ||
+           (I->getOpcode() == Instruction::BitCast &&
+            I->getType()->isPointerTy()) ||
+           hasSingleCopyAfterVectorization(I, VF));
+    VectorTy = RetTy;
+  } else
+    VectorTy = ToVectorTy(RetTy, VF);
+
   // TODO: We need to estimate the cost of intrinsic calls.
   switch (I->getOpcode()) {
   case Instruction::GetElementPtr:
@@ -7514,21 +7543,16 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
       Op2VK = TargetTransformInfo::OK_UniformValue;
SmallVector<const Value *, 4> Operands(I->operand_values());
-    unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
-    return N * TTI.getArithmeticInstrCost(
-                   I->getOpcode(), VectorTy, CostKind,
-                   TargetTransformInfo::OK_AnyValue,
-                   Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I);
+    return TTI.getArithmeticInstrCost(
+        I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue,
+        Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I);
   }
   case Instruction::FNeg: {
     assert(!VF.isScalable() && "VF is assumed to be non scalable.");
-    unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
-    return N * TTI.getArithmeticInstrCost(
-                   I->getOpcode(), VectorTy, CostKind,
-                   TargetTransformInfo::OK_AnyValue,
-                   TargetTransformInfo::OK_AnyValue,
-                   TargetTransformInfo::OP_None, TargetTransformInfo::OP_None,
-                   I->getOperand(0), I);
+    return TTI.getArithmeticInstrCost(
+        I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue,
+        TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None,
+        TargetTransformInfo::OP_None, I->getOperand(0), I);
   }
   case Instruction::Select: {
     SelectInst *SI = cast<SelectInst>(I);
@@ -7583,6 +7607,10 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
     VectorTy = ToVectorTy(getMemInstValueType(I), Width);
     return getMemoryInstructionCost(I, VF);
   }
+  case Instruction::BitCast:
+    if (I->getType()->isPointerTy())
+      return 0;
+    LLVM_FALLTHROUGH;
   case Instruction::ZExt:
   case Instruction::SExt:
   case Instruction::FPToUI:
@@ -7593,8 +7621,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
   case Instruction::SIToFP:
   case Instruction::UIToFP:
   case Instruction::Trunc:
-  case Instruction::FPTrunc:
-  case Instruction::BitCast: {
+  case Instruction::FPTrunc: {
     // Computes the CastContextHint from a Load/Store instruction.
     auto ComputeCCH = [&](Instruction *I) -> TTI::CastContextHint {
       assert((isa<LoadInst>(I) || isa<StoreInst>(I)) &&
@@ -7672,14 +7699,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
       }
     }
-    unsigned N;
-    if (isScalarAfterVectorization(I, VF)) {
-      assert(!VF.isScalable() && "VF is assumed to be non scalable");
-      N = VF.getKnownMinValue();
-    } else
-      N = 1;
-    return N *
-           TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
+    return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
   }
   case Instruction::Call: {
     bool NeedToScalarize;
@@ -7694,11 +7714,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
   case Instruction::ExtractValue:
     return TTI.getInstructionCost(I, TTI::TCK_RecipThroughput);
   default:
-    // The cost of executing VF copies of the scalar instruction. This opcode
-    // is unknown. Assume that it is the same as 'mul'.
-    return VF.getKnownMinValue() * TTI.getArithmeticInstrCost(
-                                       Instruction::Mul, VectorTy, CostKind) +
-           getScalarizationOverhead(I, VF);
+    // This opcode is unknown. Assume that it is the same as 'mul'.
+    return TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);
   } // end of switch.
 }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll
index 247ea35ff5d0..3061998518ad 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll
@@ -6,7 +6,7 @@ target triple = "aarch64--linux-gnu"
; CHECK-LABEL: all_scalar
 ; CHECK:       LV: Found scalar instruction: %i.next = add nuw nsw i64 %i, 2
-; CHECK:       LV: Found an estimated cost of 2 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2
+; CHECK:       LV: Found an estimated cost of 1 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2
 ; CHECK:       LV: Not considering vector loop of width 2 because it will not generate any vector instructions
 ;
 define void @all_scalar(i64* %a, i64 %n) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
index b0ebb4edf2ad..858b28ddd321 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
@@ -86,6 +86,41 @@ for.end:
   ret void
 }
+; CHECK-LABEL: predicated_store_phi
+;
+; Same as predicate_store except we use a pointer PHI to maintain the address
+;
+; CHECK: Found new scalar instruction:   %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ]
+; CHECK: Found new scalar instruction:   %addr.next = getelementptr inbounds i32, i32* %addr, i64 1
+; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %addr, align 4
+; CHECK: Found an estimated cost of 0 for VF 2 For instruction:   %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ]
+; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %addr, align 4
+;
+define void @predicated_store_phi(i32* %a, i1 %c, i32 %x, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
+  %addr = phi i32 * [ %a, %entry ], [ %addr.next, %for.inc ]
+  %tmp1 = load i32, i32* %addr, align 4
+  %tmp2 = add nsw i32 %tmp1, %x
+  br i1 %c, label %if.then, label %for.inc
+
+if.then:
+  store i32 %tmp2, i32* %addr, align 4
+  br label %for.inc
+
+for.inc:
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp slt i64 %i.next, %n
+  %addr.next = getelementptr inbounds i32, i32* %addr, i64 1
+  br i1 %cond, label %for.body, label %for.end
+
+for.end:
+  ret void
+}
+
 ; CHECK-LABEL: predicated_udiv_scalarized_operand
 ;
 ; This test checks that we correctly compute the cost of the predicated udiv
diff --git a/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll b/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll
new file mode 100644
index 000000000000..0c97e6ac475e
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll
@@ -0,0 +1,40 @@
+; REQUIRES: asserts
+; RUN: opt -loop-vectorize -force-vector-width=2 -debug-only=loop-vectorize -S -o - < %s 2>&1 | FileCheck %s
+
+%struct.foo = type { i32, i64 }
+
+; CHECK: LV: Found an estimated cost of 0 for VF 2 For instruction:   %0 = bitcast i64* %b to i32*
+
+; The bitcast below will be scalarized due to the predication in the loop. Bitcasts
+; between pointer types should be treated as free, despite the scalarization.
+define void @foo(%struct.foo* noalias nocapture %in, i32* noalias nocapture readnone %out, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:                                         ; preds = %entry, %if.end
+  %i.012 = phi i64 [ %inc, %if.end ], [ 0, %entry ]
+  %b = getelementptr inbounds %struct.foo, %struct.foo* %in, i64 %i.012, i32 1
+  %0 = bitcast i64* %b to i32*
+  %a = getelementptr inbounds %struct.foo, %struct.foo* %in, i64 %i.012, i32 0
+  %1 = load i32, i32* %a, align 8
+  %tobool.not = icmp eq i32 %1, 0
+  br i1 %tobool.not, label %if.end, label %land.lhs.true
+
+land.lhs.true:                                    ; preds = %for.body
+  %2 = load i32, i32* %0, align 4
+  %cmp2 = icmp sgt i32 %2, 0
+  br i1 %cmp2, label %if.then, label %if.end
+
+if.then:                                          ; preds = %land.lhs.true
+  %sub = add nsw i32 %2, -1
+  store i32 %sub, i32* %0, align 4
+  br label %if.end
+
+if.end:                                           ; preds = %if.then, %land.lhs.true, %for.body
+  %inc = add nuw nsw i64 %i.012, 1
+  %exitcond.not = icmp eq i64 %inc, %n
+  br i1 %exitcond.not, label %for.end, label %for.body
+
+for.end:                                          ; preds = %if.end
+  ret void
+}
</cut>