Successfully identified regression in *llvm* in CI configuration tcwg_bmk_llvm_tx1/llvm-release-aarch64-spec2k6-O3. So far, this commit has regressed CI configurations: - tcwg_bmk_llvm_tx1/llvm-release-aarch64-spec2k6-O3
Culprit: <cut> commit 6998f8ae2d14e096aff33968f226587b5c1a193a Author: David Sherwood david.sherwood@arm.com Date: Wed Mar 10 08:34:19 2021 +0000
[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in getInstructionCost when isScalarAfterVectorization() returns a true value. There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1; return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur in practice:
1. VF is a scalar, in which case N = 1. 2. VF is a vector. We can only get here if: a) the instruction is a GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs the cost is always 0, since this is calculated later on as part of the load/store cost. PHI nodes are costed separately and were never previously multiplied by VF. For all other cases I have added an assert that none of the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar add instruction to have been wrong, since only one copy remains after vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718 </cut>
Results regressed to (for first_bad == 6998f8ae2d14e096aff33968f226587b5c1a193a) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -- -O3 artifacts/build-6998f8ae2d14e096aff33968f226587b5c1a193a/results_id: 1 # 462.libquantum,libquantum_base.default regressed by 114 # 462.libquantum,[.] quantum_toffoli regressed by 123 # 462.libquantum,[.] quantum_cnot regressed by 115
from (for last_good == c835630c25a4f9925517949579f66a43b113fbc9) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -- -O3 artifacts/build-c835630c25a4f9925517949579f66a43b113fbc9/results_id: 1
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... Results ID of last_good: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-release-aarch64-spec2k6-O3/3744 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... Results ID of first_bad: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-release-aarch64-spec2k6-O3/3755 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-...
Configuration details:
Reproduce builds: <cut> mkdir investigate-llvm-6998f8ae2d14e096aff33968f226587b5c1a193a cd investigate-llvm-6998f8ae2d14e096aff33968f226587b5c1a193a
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build git checkout --detach 6998f8ae2d14e096aff33968f226587b5c1a193a ../artifacts/test.sh
# Reproduce last_good build git checkout --detach c835630c25a4f9925517949579f66a43b113fbc9 ../artifacts/test.sh
cd .. </cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/c...
Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-... Build log: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release-...
Full commit (up to 1000 lines): <cut> commit 6998f8ae2d14e096aff33968f226587b5c1a193a Author: David Sherwood david.sherwood@arm.com Date: Wed Mar 10 08:34:19 2021 +0000
[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in getInstructionCost when isScalarAfterVectorization() returns a true value. There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1; return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur in practice:
1. VF is a scalar, in which case N = 1. 2. VF is a vector. We can only get here if: a) the instruction is a GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs the cost is always 0, since this is calculated later on as part of the load/store cost. PHI nodes are costed separately and were never previously multiplied by VF. For all other cases I have added an assert that none of the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar add instruction to have been wrong, since only one copy remains after vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718 --- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 73 +++++++++++++--------- .../AArch64/no_vector_instructions.ll | 2 +- .../LoopVectorize/AArch64/predication_costs.ll | 35 +++++++++++ .../Transforms/LoopVectorize/scalarized-bitcast.ll | 40 ++++++++++++ 4 files changed, 121 insertions(+), 29 deletions(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 2b413fc49505..f25af23c86c2 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -7383,10 +7383,39 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Type *RetTy = I->getType(); if (canTruncateToMinimalBitwidth(I, VF)) RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]); - VectorTy = isScalarAfterVectorization(I, VF) ? RetTy : ToVectorTy(RetTy, VF); auto SE = PSE.getSE(); TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+ auto hasSingleCopyAfterVectorization = [this](Instruction *I, + ElementCount VF) -> bool { + if (VF.isScalar()) + return true; + + auto Scalarized = InstsToScalarize.find(VF); + assert(Scalarized != InstsToScalarize.end() && + "VF not yet analyzed for scalarization profitability"); + return !Scalarized->second.count(I) && + llvm::all_of(I->users(), [&](User *U) { + auto *UI = cast<Instruction>(U); + return !Scalarized->second.count(UI); + }); + }; + + if (isScalarAfterVectorization(I, VF)) { + // With the exception of GEPs and PHIs, after scalarization there should + // only be one copy of the instruction generated in the loop. This is + // because the VF is either 1, or any instructions that need scalarizing + // have already been dealt with by the the time we get here. As a result, + // it means we don't have to multiply the instruction cost by VF. + assert(I->getOpcode() == Instruction::GetElementPtr || + I->getOpcode() == Instruction::PHI || + (I->getOpcode() == Instruction::BitCast && + I->getType()->isPointerTy()) || + hasSingleCopyAfterVectorization(I, VF)); + VectorTy = RetTy; + } else + VectorTy = ToVectorTy(RetTy, VF); + // TODO: We need to estimate the cost of intrinsic calls. switch (I->getOpcode()) { case Instruction::GetElementPtr: @@ -7514,21 +7543,16 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Op2VK = TargetTransformInfo::OK_UniformValue;
SmallVector<const Value *, 4> Operands(I->operand_values()); - unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1; - return N * TTI.getArithmeticInstrCost( - I->getOpcode(), VectorTy, CostKind, - TargetTransformInfo::OK_AnyValue, - Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I); + return TTI.getArithmeticInstrCost( + I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue, + Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I); } case Instruction::FNeg: { assert(!VF.isScalable() && "VF is assumed to be non scalable."); - unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1; - return N * TTI.getArithmeticInstrCost( - I->getOpcode(), VectorTy, CostKind, - TargetTransformInfo::OK_AnyValue, - TargetTransformInfo::OK_AnyValue, - TargetTransformInfo::OP_None, TargetTransformInfo::OP_None, - I->getOperand(0), I); + return TTI.getArithmeticInstrCost( + I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue, + TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None, + TargetTransformInfo::OP_None, I->getOperand(0), I); } case Instruction::Select: { SelectInst *SI = cast<SelectInst>(I); @@ -7583,6 +7607,10 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, VectorTy = ToVectorTy(getMemInstValueType(I), Width); return getMemoryInstructionCost(I, VF); } + case Instruction::BitCast: + if (I->getType()->isPointerTy()) + return 0; + LLVM_FALLTHROUGH; case Instruction::ZExt: case Instruction::SExt: case Instruction::FPToUI: @@ -7593,8 +7621,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, case Instruction::SIToFP: case Instruction::UIToFP: case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { + case Instruction::FPTrunc: { // Computes the CastContextHint from a Load/Store instruction. auto ComputeCCH = [&](Instruction *I) -> TTI::CastContextHint { assert((isa<LoadInst>(I) || isa<StoreInst>(I)) && @@ -7672,14 +7699,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, } }
- unsigned N; - if (isScalarAfterVectorization(I, VF)) { - assert(!VF.isScalable() && "VF is assumed to be non scalable"); - N = VF.getKnownMinValue(); - } else - N = 1; - return N * - TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I); + return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I); } case Instruction::Call: { bool NeedToScalarize; @@ -7694,11 +7714,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, case Instruction::ExtractValue: return TTI.getInstructionCost(I, TTI::TCK_RecipThroughput); default: - // The cost of executing VF copies of the scalar instruction. This opcode - // is unknown. Assume that it is the same as 'mul'. - return VF.getKnownMinValue() * TTI.getArithmeticInstrCost( - Instruction::Mul, VectorTy, CostKind) + - getScalarizationOverhead(I, VF); + // This opcode is unknown. Assume that it is the same as 'mul'. + return TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind); } // end of switch. }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll index 247ea35ff5d0..3061998518ad 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll @@ -6,7 +6,7 @@ target triple = "aarch64--linux-gnu"
; CHECK-LABEL: all_scalar ; CHECK: LV: Found scalar instruction: %i.next = add nuw nsw i64 %i, 2 -; CHECK: LV: Found an estimated cost of 2 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2 +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2 ; CHECK: LV: Not considering vector loop of width 2 because it will not generate any vector instructions ; define void @all_scalar(i64* %a, i64 %n) { diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll index b0ebb4edf2ad..858b28ddd321 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll @@ -86,6 +86,41 @@ for.end: ret void }
+; CHECK-LABEL: predicated_store_phi +; +; Same as predicate_store except we use a pointer PHI to maintain the address +; +; CHECK: Found new scalar instruction: %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ] +; CHECK: Found new scalar instruction: %addr.next = getelementptr inbounds i32, i32* %addr, i64 1 +; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %addr, align 4 +; CHECK: Found an estimated cost of 0 for VF 2 For instruction: %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ] +; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %addr, align 4 +; +define void @predicated_store_phi(i32* %a, i1 %c, i32 %x, i64 %n) { +entry: + br label %for.body + +for.body: + %i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ] + %addr = phi i32 * [ %a, %entry ], [ %addr.next, %for.inc ] + %tmp1 = load i32, i32* %addr, align 4 + %tmp2 = add nsw i32 %tmp1, %x + br i1 %c, label %if.then, label %for.inc + +if.then: + store i32 %tmp2, i32* %addr, align 4 + br label %for.inc + +for.inc: + %i.next = add nuw nsw i64 %i, 1 + %cond = icmp slt i64 %i.next, %n + %addr.next = getelementptr inbounds i32, i32* %addr, i64 1 + br i1 %cond, label %for.body, label %for.end + +for.end: + ret void +} + ; CHECK-LABEL: predicated_udiv_scalarized_operand ; ; This test checks that we correctly compute the cost of the predicated udiv diff --git a/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll b/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll new file mode 100644 index 000000000000..0c97e6ac475e --- /dev/null +++ b/llvm/test/Transforms/LoopVectorize/scalarized-bitcast.ll @@ -0,0 +1,40 @@ +; REQUIRES: asserts +; RUN: opt -loop-vectorize -force-vector-width=2 -debug-only=loop-vectorize -S -o - < %s 2>&1 | FileCheck %s + +%struct.foo = type { i32, i64 } + +; CHECK: LV: Found an estimated cost of 0 for VF 2 For instruction: %0 = bitcast i64* %b to i32* + +; The bitcast below will be scalarized due to the predication in the loop. Bitcasts +; between pointer types should be treated as free, despite the scalarization. +define void @foo(%struct.foo* noalias nocapture %in, i32* noalias nocapture readnone %out, i64 %n) { +entry: + br label %for.body + +for.body: ; preds = %entry, %if.end + %i.012 = phi i64 [ %inc, %if.end ], [ 0, %entry ] + %b = getelementptr inbounds %struct.foo, %struct.foo* %in, i64 %i.012, i32 1 + %0 = bitcast i64* %b to i32* + %a = getelementptr inbounds %struct.foo, %struct.foo* %in, i64 %i.012, i32 0 + %1 = load i32, i32* %a, align 8 + %tobool.not = icmp eq i32 %1, 0 + br i1 %tobool.not, label %if.end, label %land.lhs.true + +land.lhs.true: ; preds = %for.body + %2 = load i32, i32* %0, align 4 + %cmp2 = icmp sgt i32 %2, 0 + br i1 %cmp2, label %if.then, label %if.end + +if.then: ; preds = %land.lhs.true + %sub = add nsw i32 %2, -1 + store i32 %sub, i32* %0, align 4 + br label %if.end + +if.end: ; preds = %if.then, %land.lhs.true, %for.body + %inc = add nuw nsw i64 %i.012, 1 + %exitcond.not = icmp eq i64 %inc, %n + br i1 %exitcond.not, label %for.end, label %for.body + +for.end: ; preds = %if.end + ret void +} </cut>
linaro-toolchain@lists.linaro.org