After llvmorg-16-init-16383-g9b5f62685ab4 commit 9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b Author: Alexey Bataev a.bataev@outlook.com
[SLP]Fix cost of the broadcast buildvector/gather.
the following benchmarks slowed down by more than 3%: - 445.gobmk slowed down by 6% from 10321 to 10904 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -O3 -flto -marm - Hardware:
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
For latest status see comments in https://linaro.atlassian.net/browse/GNU-692 . Status of llvmorg-16-init-16383-g9b5f62685ab4 commit for tcwg_bmk-code_speed-spec2k6: commit 9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b Author: Alexey Bataev a.bataev@outlook.com Date: Wed Dec 21 13:38:38 2022 -0800
[SLP]Fix cost of the broadcast buildvector/gather.
Need to include the cost of the initial insertelement to the cost of the broadcasts. Also, need to adjust the cost of the gather/buildvector if the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498 * llvm-arm-master-O3_LTO ** After llvmorg-16-init-16383-g9b5f62685ab4 commit 9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b ** Author: Alexey Bataev a.bataev@outlook.com ** ** [SLP]Fix cost of the broadcast buildvector/gather. ** ** the following benchmarks slowed down by more than 3%: ** - 445.gobmk slowed down by 6% from 10321 to 10904 perf samples ** https://ci.linaro.org/job/tcwg_bmk-code_speed-spec2k6-llvm-arm-master-O3_LTO...
Bad build: https://ci.linaro.org/job/tcwg_bmk-code_speed-spec2k6-llvm-arm-master-O3_LTO... Good build: https://ci.linaro.org/job/tcwg_bmk-code_speed-spec2k6-llvm-arm-master-O3_LTO...
Reproduce current build: <cut> mkdir -p investigate-llvm-9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b cd investigate-llvm-9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b
# Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests for bad and good builds mkdir -p bad/artifacts good/artifacts curl -o bad/artifacts/manifest.sh https://ci.linaro.org/job/tcwg_bmk-code_speed-spec2k6-llvm-arm-master-O3_LTO... --fail curl -o good/artifacts/manifest.sh https://ci.linaro.org/job/tcwg_bmk-code_speed-spec2k6-llvm-arm-master-O3_LTO... --fail
# Reproduce bad build (cd bad; ../jenkins-scripts/tcwg_bmk-build.sh ^^ true %%rr[top_artifacts] artifacts) # Reproduce good build (cd good; ../jenkins-scripts/tcwg_bmk-build.sh ^^ true %%rr[top_artifacts] artifacts) </cut>
Full commit (up to 1000 lines): <cut> commit 9b5f62685ab447ba9d3ea8ac2616e0c76a44d21b Author: Alexey Bataev a.bataev@outlook.com Date: Wed Dec 21 13:38:38 2022 -0800
[SLP]Fix cost of the broadcast buildvector/gather.
Need to include the cost of the initial insertelement to the cost of the broadcasts. Also, need to adjust the cost of the gather/buildvector if the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498 --- llvm/include/llvm/Analysis/TargetTransformInfo.h | 12 +- .../llvm/Analysis/TargetTransformInfoImpl.h | 4 +- llvm/include/llvm/CodeGen/BasicTTIImpl.h | 54 +-- llvm/lib/Analysis/TargetTransformInfo.cpp | 8 +- .../Target/AArch64/AArch64TargetTransformInfo.cpp | 7 +- .../Target/AArch64/AArch64TargetTransformInfo.h | 4 +- .../Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | 7 +- llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h | 2 +- llvm/lib/Target/AMDGPU/R600TargetTransformInfo.cpp | 7 +- llvm/lib/Target/AMDGPU/R600TargetTransformInfo.h | 2 +- llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | 7 +- llvm/lib/Target/ARM/ARMTargetTransformInfo.h | 4 +- .../Target/Hexagon/HexagonTargetTransformInfo.cpp | 6 +- .../Target/Hexagon/HexagonTargetTransformInfo.h | 4 +- llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp | 9 +- llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h | 4 +- llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | 7 +- llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h | 4 +- .../Target/SystemZ/SystemZTargetTransformInfo.cpp | 5 +- .../Target/SystemZ/SystemZTargetTransformInfo.h | 4 +- .../WebAssembly/WebAssemblyTargetTransformInfo.cpp | 5 +- .../WebAssembly/WebAssemblyTargetTransformInfo.h | 4 +- llvm/lib/Target/X86/X86TargetTransformInfo.cpp | 42 ++- llvm/lib/Target/X86/X86TargetTransformInfo.h | 4 +- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 21 +- .../Analysis/CostModel/X86/loop_v2-inseltpoison.ll | 2 +- llvm/test/Analysis/CostModel/X86/loop_v2.ll | 2 +- .../X86/masked-intrinsic-cost-inseltpoison.ll | 4 +- .../CostModel/X86/masked-intrinsic-cost.ll | 4 +- .../CostModel/X86/vector-insert-inseltpoison.ll | 120 +++---- llvm/test/Analysis/CostModel/X86/vector-insert.ll | 120 +++---- .../Analysis/CostModel/X86/vshift-ashr-codesize.ll | 50 +-- .../CostModel/X86/vshift-ashr-cost-inseltpoison.ll | 102 ++---- .../Analysis/CostModel/X86/vshift-ashr-cost.ll | 102 ++---- .../Analysis/CostModel/X86/vshift-ashr-latency.ll | 18 +- .../CostModel/X86/vshift-ashr-sizelatency.ll | 50 +-- .../Analysis/CostModel/X86/vshift-lshr-codesize.ll | 82 +---- .../CostModel/X86/vshift-lshr-cost-inseltpoison.ll | 102 ++---- .../Analysis/CostModel/X86/vshift-lshr-cost.ll | 102 ++---- .../Analysis/CostModel/X86/vshift-lshr-latency.ll | 102 ++---- .../CostModel/X86/vshift-lshr-sizelatency.ll | 82 +---- .../Analysis/CostModel/X86/vshift-shl-codesize.ll | 82 +---- .../CostModel/X86/vshift-shl-cost-inseltpoison.ll | 138 ++----- .../test/Analysis/CostModel/X86/vshift-shl-cost.ll | 138 ++----- .../Analysis/CostModel/X86/vshift-shl-latency.ll | 102 ++---- .../CostModel/X86/vshift-shl-sizelatency.ll | 174 ++------- llvm/test/Transforms/SLPVectorizer/X86/cse.ll | 7 +- .../Transforms/SLPVectorizer/X86/malformed_phis.ll | 140 ++++---- .../X86/remark_gather-load-redux-cost.ll | 2 +- .../SLPVectorizer/X86/used-reduced-op.ll | 399 +++++++++++---------- 50 files changed, 941 insertions(+), 1522 deletions(-)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h index 6200af73842c..a9cb8717ffa8 100644 --- a/llvm/include/llvm/Analysis/TargetTransformInfo.h +++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h @@ -1193,7 +1193,8 @@ public: /// case is to provision the cost of vectorization/scalarization in /// vectorizer passes. InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index = -1) const; + unsigned Index = -1, Value *Op0 = nullptr, + Value *Op1 = nullptr) const;
/// \return The expected cost of vector Insert and Extract. /// This is used when instruction is available, and implementation @@ -1786,7 +1787,8 @@ public: TTI::TargetCostKind CostKind, const Instruction *I) = 0; virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) = 0; + unsigned Index, Value *Op0, + Value *Op1) = 0; virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val, unsigned Index) = 0;
@@ -2358,9 +2360,9 @@ public: const Instruction *I) override { return Impl.getCmpSelInstrCost(Opcode, ValTy, CondTy, VecPred, CostKind, I); } - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) override { - return Impl.getVectorInstrCost(Opcode, Val, Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1) override { + return Impl.getVectorInstrCost(Opcode, Val, Index, Op0, Op1); } InstructionCost getVectorInstrCost(const Instruction &I, Type *Val, unsigned Index) override { diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h index e81e430f6624..262b42a05d99 100644 --- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h +++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h @@ -585,8 +585,8 @@ public: return 1; }
- InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) const { + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1) const { return 1; }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h index aabb94d82c4b..f27c6899d757 100644 --- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h +++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h @@ -90,10 +90,12 @@ private: InstructionCost Cost = 0; // Broadcast cost is equal to the cost of extracting the zero'th element // plus the cost of inserting it into every element of the result vector. - Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, VTy, 0); + Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, VTy, 0, + nullptr, nullptr);
for (int i = 0, e = VTy->getNumElements(); i < e; ++i) { - Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, VTy, i); + Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, VTy, i, + nullptr, nullptr); } return Cost; } @@ -110,8 +112,10 @@ private: // vector and finally index 3 of second vector and insert them at index // <0,1,2,3> of result vector. for (int i = 0, e = VTy->getNumElements(); i < e; ++i) { - Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, VTy, i); - Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, VTy, i); + Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, VTy, i, + nullptr, nullptr); + Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, VTy, i, + nullptr, nullptr); } return Cost; } @@ -134,9 +138,9 @@ private: // type. for (int i = 0; i != NumSubElts; ++i) { Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, VTy, - i + Index); - Cost += - thisT()->getVectorInstrCost(Instruction::InsertElement, SubVTy, i); + i + Index, nullptr, nullptr); + Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, SubVTy, i, + nullptr, nullptr); } return Cost; } @@ -158,10 +162,10 @@ private: // the source type plus the cost of inserting them into the result vector // type. for (int i = 0; i != NumSubElts; ++i) { - Cost += - thisT()->getVectorInstrCost(Instruction::ExtractElement, SubVTy, i); + Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, SubVTy, + i, nullptr, nullptr); Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, VTy, - i + Index); + i + Index, nullptr, nullptr); } return Cost; } @@ -212,7 +216,7 @@ private: FixedVectorType::get( PointerType::get(VT->getElementType(), 0), VT->getNumElements()), - -1) + -1, nullptr, nullptr) : 0; InstructionCost LoadCost = VT->getNumElements() * @@ -237,7 +241,7 @@ private: Instruction::ExtractElement, FixedVectorType::get(Type::getInt1Ty(DataTy->getContext()), VT->getNumElements()), - -1) + + -1, nullptr, nullptr) + getCFInstrCost(Instruction::Br, CostKind) + getCFInstrCost(Instruction::PHI, CostKind)); } @@ -722,9 +726,11 @@ public: if (!DemandedElts[i]) continue; if (Insert) - Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, Ty, i); + Cost += thisT()->getVectorInstrCost(Instruction::InsertElement, Ty, i, + nullptr, nullptr); if (Extract) - Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, i); + Cost += thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, i, + nullptr, nullptr); }
return Cost; @@ -1123,7 +1129,7 @@ public: InstructionCost getExtractWithExtendCost(unsigned Opcode, Type *Dst, VectorType *VecTy, unsigned Index) { return thisT()->getVectorInstrCost(Instruction::ExtractElement, VecTy, - Index) + + Index, nullptr, nullptr) + thisT()->getCastInstrCost(Opcode, Dst, VecTy->getElementType(), TTI::CastContextHint::None, TTI::TCK_RecipThroughput); @@ -1184,14 +1190,20 @@ public: return 1; }
- InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1) { return getRegUsageForType(Val->getScalarType()); }
InstructionCost getVectorInstrCost(const Instruction &I, Type *Val, unsigned Index) { - return thisT()->getVectorInstrCost(I.getOpcode(), Val, Index); + Value *Op0 = nullptr; + Value *Op1 = nullptr; + if (auto *IE = dyn_cast<InsertElementInst>(&I)) { + Op0 = IE->getOperand(0); + Op1 = IE->getOperand(1); + } + return thisT()->getVectorInstrCost(I.getOpcode(), Val, Index, Op0, Op1); }
InstructionCost getReplicationShuffleCost(Type *EltTy, int ReplicationFactor, @@ -2246,7 +2258,8 @@ public: ArithCost += NumReduxLevels * thisT()->getArithmeticInstrCost(Opcode, Ty, CostKind); return ShuffleCost + ArithCost + - thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0); + thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0, + nullptr, nullptr); }
/// Try to calculate the cost of performing strict (in-order) reductions, @@ -2353,7 +2366,8 @@ public: // The last min/max should be in vector registers and we counted it above. // So just need a single extractelement. return ShuffleCost + MinMaxCost + - thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0); + thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0, + nullptr, nullptr); }
InstructionCost getExtendedReductionCost(unsigned Opcode, bool IsUnsigned, diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp index 7459ce18c3cf..d03a8cf14172 100644 --- a/llvm/lib/Analysis/TargetTransformInfo.cpp +++ b/llvm/lib/Analysis/TargetTransformInfo.cpp @@ -897,13 +897,13 @@ InstructionCost TargetTransformInfo::getCmpSelInstrCost( return Cost; }
-InstructionCost TargetTransformInfo::getVectorInstrCost(unsigned Opcode, - Type *Val, - unsigned Index) const { +InstructionCost TargetTransformInfo::getVectorInstrCost( + unsigned Opcode, Type *Val, unsigned Index, Value *Op0, Value *Op1) const { // FIXME: Assert that Opcode is either InsertElement or ExtractElement. // This is mentioned in the interface description and respected by all // callers, but never asserted upon. - InstructionCost Cost = TTIImpl->getVectorInstrCost(Opcode, Val, Index); + InstructionCost Cost = + TTIImpl->getVectorInstrCost(Opcode, Val, Index, Op0, Op1); assert(Cost >= 0 && "TTI should not produce negative costs!"); return Cost; } diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp index ae12ae951d75..f5f6c07f766a 100644 --- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp +++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp @@ -2034,8 +2034,8 @@ InstructionCost AArch64TTIImpl::getExtractWithExtendCost(unsigned Opcode,
// Get the cost for the extract. We compute the cost (if any) for the extend // below. - InstructionCost Cost = - getVectorInstrCost(Instruction::ExtractElement, VecTy, Index); + InstructionCost Cost = getVectorInstrCost(Instruction::ExtractElement, VecTy, + Index, nullptr, nullptr);
// Legalize the types. auto VecLT = getTypeLegalizationCost(VecTy); @@ -2128,7 +2128,8 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(Type *Val, }
InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { return getVectorInstrCostHelper(Val, Index, false /* HasRealUse */); }
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h index e309117a885b..6eaff9566b8c 100644 --- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h +++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h @@ -169,8 +169,8 @@ public: InstructionCost getCFInstrCost(unsigned Opcode, TTI::TargetCostKind CostKind, const Instruction *I = nullptr);
- InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1); InstructionCost getVectorInstrCost(const Instruction &I, Type *Val, unsigned Index);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp index af72ba2daa2d..00e6970291bf 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp @@ -790,7 +790,8 @@ GCNTTIImpl::getMinMaxReductionCost(VectorType *Ty, VectorType *CondTy, }
InstructionCost GCNTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { switch (Opcode) { case Instruction::ExtractElement: case Instruction::InsertElement: { @@ -799,7 +800,7 @@ InstructionCost GCNTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, if (EltSize < 32) { if (EltSize == 16 && Index == 0 && ST->has16BitInsts()) return 0; - return BaseT::getVectorInstrCost(Opcode, ValTy, Index); + return BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1); }
// Extracts are just reads of a subregister, so are free. Inserts are @@ -810,7 +811,7 @@ InstructionCost GCNTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, return Index == ~0u ? 2 : 0; } default: - return BaseT::getVectorInstrCost(Opcode, ValTy, Index); + return BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1); } }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h index 347ce87acd26..4a1137dcf2e2 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h @@ -162,7 +162,7 @@ public:
using BaseT::getVectorInstrCost; InstructionCost getVectorInstrCost(unsigned Opcode, Type *ValTy, - unsigned Index); + unsigned Index, Value *Op0, Value *Op1);
bool isReadRegisterSourceOfDivergence(const IntrinsicInst *ReadReg) const; bool isSourceOfDivergence(const Value *V) const; diff --git a/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.cpp index 365c005b2503..c3dd321a7b9c 100644 --- a/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.cpp +++ b/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.cpp @@ -108,14 +108,15 @@ InstructionCost R600TTIImpl::getCFInstrCost(unsigned Opcode, }
InstructionCost R600TTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { switch (Opcode) { case Instruction::ExtractElement: case Instruction::InsertElement: { unsigned EltSize = DL.getTypeSizeInBits(cast<VectorType>(ValTy)->getElementType()); if (EltSize < 32) { - return BaseT::getVectorInstrCost(Opcode, ValTy, Index); + return BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1); }
// Extracts are just reads of a subregister, so are free. Inserts are @@ -126,7 +127,7 @@ InstructionCost R600TTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, return Index == ~0u ? 2 : 0; } default: - return BaseT::getVectorInstrCost(Opcode, ValTy, Index); + return BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1); } }
diff --git a/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.h b/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.h index f1a198fd14e4..9045cc773189 100644 --- a/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.h +++ b/llvm/lib/Target/AMDGPU/R600TargetTransformInfo.h @@ -62,7 +62,7 @@ public: const Instruction *I = nullptr); using BaseT::getVectorInstrCost; InstructionCost getVectorInstrCost(unsigned Opcode, Type *ValTy, - unsigned Index); + unsigned Index, Value *Op0, Value *Op1); };
} // end namespace llvm diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp index 8eec432a4a66..07786ea82738 100644 --- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp +++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp @@ -874,7 +874,8 @@ InstructionCost ARMTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst, }
InstructionCost ARMTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { // Penalize inserting into an D-subregister. We end up with a three times // lower estimated throughput on swift. if (ST->hasSlowLoadDSubregister() && Opcode == Instruction::InsertElement && @@ -893,7 +894,7 @@ InstructionCost ARMTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, if (ValTy->isVectorTy() && ValTy->getScalarSizeInBits() <= 32) return std::max<InstructionCost>( - BaseT::getVectorInstrCost(Opcode, ValTy, Index), 2U); + BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1), 2U); }
if (ST->hasMVEIntegerOps() && (Opcode == Instruction::InsertElement || @@ -906,7 +907,7 @@ InstructionCost ARMTTIImpl::getVectorInstrCost(unsigned Opcode, Type *ValTy, return LT.first * (ValTy->getScalarType()->isIntegerTy() ? 4 : 1); }
- return BaseT::getVectorInstrCost(Opcode, ValTy, Index); + return BaseT::getVectorInstrCost(Opcode, ValTy, Index, Op0, Op1); }
InstructionCost ARMTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy, diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h index db96c3da54cf..6b1e6444c516 100644 --- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h +++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h @@ -240,8 +240,8 @@ public: const Instruction *I = nullptr);
using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1);
InstructionCost getAddressComputationCost(Type *Val, ScalarEvolution *SE, const SCEV *Ptr); diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp index 779577816fb9..6089c865cedf 100644 --- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp +++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp @@ -329,7 +329,8 @@ InstructionCost HexagonTTIImpl::getCastInstrCost(unsigned Opcode, Type *DstTy, }
InstructionCost HexagonTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { Type *ElemTy = Val->isVectorTy() ? cast<VectorType>(Val)->getElementType() : Val; if (Opcode == Instruction::InsertElement) { @@ -338,7 +339,8 @@ InstructionCost HexagonTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, if (ElemTy->isIntegerTy(32)) return Cost; // If it's not a 32-bit value, there will need to be an extract. - return Cost + getVectorInstrCost(Instruction::ExtractElement, Val, Index); + return Cost + getVectorInstrCost(Instruction::ExtractElement, Val, Index, + Op0, Op1); }
if (Opcode == Instruction::ExtractElement) diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h index 49d9520b8323..d41299ff6413 100644 --- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h +++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h @@ -154,8 +154,8 @@ public: TTI::TargetCostKind CostKind, const Instruction *I = nullptr); using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1);
InstructionCost getCFInstrCost(unsigned Opcode, TTI::TargetCostKind CostKind, const Instruction *I = nullptr) { diff --git a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp index 3b952f11be34..328a70ec43f6 100644 --- a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp +++ b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp @@ -675,7 +675,8 @@ InstructionCost PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy, }
InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { assert(Val->isVectorTy() && "This must be a vector type");
int ISD = TLI->InstructionOpcodeToISD(Opcode); @@ -685,7 +686,8 @@ InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, if (!CostFactor.isValid()) return InstructionCost::getMax();
- InstructionCost Cost = BaseT::getVectorInstrCost(Opcode, Val, Index); + InstructionCost Cost = + BaseT::getVectorInstrCost(Opcode, Val, Index, Op0, Op1); Cost *= CostFactor;
if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) { @@ -827,7 +829,8 @@ InstructionCost PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src, if (Src->isVectorTy() && Opcode == Instruction::Store) for (int i = 0, e = cast<FixedVectorType>(Src)->getNumElements(); i < e; ++i) - Cost += getVectorInstrCost(Instruction::ExtractElement, Src, i); + Cost += getVectorInstrCost(Instruction::ExtractElement, Src, i, nullptr, + nullptr);
return Cost; } diff --git a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h index 9db903baf407..810a7d0d62ef 100644 --- a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h +++ b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h @@ -126,8 +126,8 @@ public: TTI::TargetCostKind CostKind, const Instruction *I = nullptr); using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1); InstructionCost getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind, diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp index 02ce1b135f7f..ed8af25998b0 100644 --- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp +++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp @@ -1216,12 +1216,13 @@ InstructionCost RISCVTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy, }
InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { assert(Val->isVectorTy() && "This must be a vector type");
if (Opcode != Instruction::ExtractElement && Opcode != Instruction::InsertElement) - return BaseT::getVectorInstrCost(Opcode, Val, Index); + return BaseT::getVectorInstrCost(Opcode, Val, Index, Op0, Op1);
// Legalize the type. std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val); @@ -1235,7 +1236,7 @@ InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, return LT.first;
if (!isTypeLegal(Val)) - return BaseT::getVectorInstrCost(Opcode, Val, Index); + return BaseT::getVectorInstrCost(Opcode, Val, Index, Op0, Op1);
// In RVV, we could use vslidedown + vmv.x.s to extract element from vector // and vslideup + vmv.s.x to insert element to vector. diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h index 80c7ca3564d7..5df266ba35b5 100644 --- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h +++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h @@ -157,8 +157,8 @@ public: const Instruction *I = nullptr);
using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1);
InstructionCost getArithmeticInstrCost( unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind, diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp index 5d00e56ae347..d6736319a404 100644 --- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp +++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp @@ -996,7 +996,8 @@ InstructionCost SystemZTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy, }
InstructionCost SystemZTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { // vlvgp will insert two grs into a vector register, so only count half the // number of instructions. if (Opcode == Instruction::InsertElement && Val->isIntOrIntVectorTy(64)) @@ -1012,7 +1013,7 @@ InstructionCost SystemZTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, return Cost; }
- return BaseT::getVectorInstrCost(Opcode, Val, Index); + return BaseT::getVectorInstrCost(Opcode, Val, Index, Op0, Op1); }
// Check if a load may be folded as a memory operand in its user. diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h index 5ac3d8149a1d..33c3778d572c 100644 --- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h +++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h @@ -107,8 +107,8 @@ public: TTI::TargetCostKind CostKind, const Instruction *I = nullptr); using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1); bool isFoldableLoad(const LoadInst *Ld, const Instruction *&FoldedValue); InstructionCost getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment, diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp index 38464627e742..b94dcd63ad8b 100644 --- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp +++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp @@ -82,9 +82,10 @@ InstructionCost WebAssemblyTTIImpl::getArithmeticInstrCost(
InstructionCost WebAssemblyTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, + Value *Op0, Value *Op1) { InstructionCost Cost = - BasicTTIImplBase::getVectorInstrCost(Opcode, Val, Index); + BasicTTIImplBase::getVectorInstrCost(Opcode, Val, Index, Op0, Op1);
// SIMD128's insert/extract currently only take constant indices. if (Index == -1u) diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h index 7eed7ef44af7..4f54a762042f 100644 --- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h +++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h @@ -66,8 +66,8 @@ public: ArrayRef<const Value *> Args = ArrayRef<const Value *>(), const Instruction *CxtI = nullptr); using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1);
/// @}
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp index 7d08a1654be7..5b6c7d86cebe 100644 --- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp +++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp @@ -4257,7 +4257,8 @@ X86TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA, }
InstructionCost X86TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index) { + unsigned Index, Value *Op0, + Value *Op1) { static const CostTblEntry SLMCostTbl[] = { { ISD::EXTRACT_VECTOR_ELT, MVT::i8, 4 }, { ISD::EXTRACT_VECTOR_ELT, MVT::i16, 4 }, @@ -4330,6 +4331,14 @@ InstructionCost X86TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, } }
+ MVT MScalarTy = LT.second.getScalarType(); + auto IsCheapPInsrPExtrInsertPS = [&]() { + return (MScalarTy == MVT::i16 && ST->hasSSE2()) || + (MScalarTy.isInteger() && ST->hasSSE41()) || + (MScalarTy == MVT::f32 && ST->hasSSE41() && + Opcode == Instruction::InsertElement); + }; + if (Index == 0) { // Floating point scalars are already located in index #0. // Many insertions to #0 can fold away for scalar fp-ops, so let's assume @@ -4337,6 +4346,20 @@ InstructionCost X86TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, if (ScalarType->isFloatingPointTy()) return RegisterFileMoveCost;
+ if (Opcode == Instruction::InsertElement && + isa_and_nonnull<UndefValue>(Op0)) { + // Consider the gather cost to be cheap. + if (isa_and_nonnull<LoadInst>(Op1)) + return RegisterFileMoveCost; + if (!IsCheapPInsrPExtrInsertPS()) { + // mov constant-to-GPR + movd/movq GPR -> XMM. + if (isa_and_nonnull<Constant>(Op1) && Op1->getType()->isIntegerTy()) + return 2 + RegisterFileMoveCost; + // Assume movd/movq GPR -> XMM is relatively cheap on all targets. + return 1 + RegisterFileMoveCost; + } + } + // Assume movd/movq XMM -> GPR is relatively cheap on all targets. if (ScalarType->isIntegerTy() && Opcode == Instruction::ExtractElement) return 1 + RegisterFileMoveCost; @@ -4344,19 +4367,13 @@ InstructionCost X86TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
int ISD = TLI->InstructionOpcodeToISD(Opcode); assert(ISD && "Unexpected vector opcode"); - MVT MScalarTy = LT.second.getScalarType(); if (ST->useSLMArithCosts()) if (auto *Entry = CostTableLookup(SLMCostTbl, ISD, MScalarTy)) return Entry->Cost + RegisterFileMoveCost;
// Assume pinsr/pextr XMM <-> GPR is relatively cheap on all targets. - if ((MScalarTy == MVT::i16 && ST->hasSSE2()) || - (MScalarTy.isInteger() && ST->hasSSE41())) - return 1 + RegisterFileMoveCost; - // Assume insertps is relatively cheap on all targets. - if (MScalarTy == MVT::f32 && ST->hasSSE41() && - Opcode == Instruction::InsertElement) + if (IsCheapPInsrPExtrInsertPS()) return 1 + RegisterFileMoveCost;
// For extractions we just need to shuffle the element to index 0, which @@ -4383,7 +4400,8 @@ InstructionCost X86TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, if (Opcode == Instruction::ExtractElement && ScalarType->isPointerTy()) RegisterFileMoveCost += 1;
- return BaseT::getVectorInstrCost(Opcode, Val, Index) + RegisterFileMoveCost; + return BaseT::getVectorInstrCost(Opcode, Val, Index, Op0, Op1) + + RegisterFileMoveCost; }
InstructionCost X86TTIImpl::getScalarizationOverhead(VectorType *Ty, @@ -5155,7 +5173,8 @@ X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy, }
// Add the final extract element to the cost. - return ReductionCost + getVectorInstrCost(Instruction::ExtractElement, Ty, 0); + return ReductionCost + getVectorInstrCost(Instruction::ExtractElement, Ty, 0, + nullptr, nullptr); }
InstructionCost X86TTIImpl::getMinMaxCost(Type *Ty, Type *CondTy, @@ -5455,7 +5474,8 @@ X86TTIImpl::getMinMaxReductionCost(VectorType *ValTy, VectorType *CondTy, }
// Add the final extract element to the cost. - return MinMaxCost + getVectorInstrCost(Instruction::ExtractElement, Ty, 0); + return MinMaxCost + getVectorInstrCost(Instruction::ExtractElement, Ty, 0, + nullptr, nullptr); }
/// Calculate the cost of materializing a 64-bit value. This helper diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h index 666789e160dc..c189e503f4e8 100644 --- a/llvm/lib/Target/X86/X86TargetTransformInfo.h +++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h @@ -147,8 +147,8 @@ public: TTI::TargetCostKind CostKind, const Instruction *I = nullptr); using BaseT::getVectorInstrCost; - InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, - unsigned Index); + InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index, + Value *Op0, Value *Op1); InstructionCost getScalarizationOverhead(VectorType *Ty, const APInt &DemandedElts, bool Insert, bool Extract); diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index e1b52aa2f80e..8ca422cfab9f 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -6745,9 +6745,24 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // broadcast. assert(VecTy == FinalVecTy && "No reused scalars expected for broadcast."); - return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, - /*Mask=*/std::nullopt, CostKind, /*Index=*/0, - /*SubTp=*/nullptr, /*Args=*/VL[0]); + const auto *It = + find_if(VL, [](Value *V) { return !isa<UndefValue>(V); }); + // If all values are undefs - consider cost free. + if (It == VL.end()) + return TTI::TCC_Free; + // Add broadcast for non-identity shuffle only. + bool NeedShuffle = + VL.front() != *It || !all_of(VL.drop_front(), UndefValue::classof); + InstructionCost InsertCost = + TTI->getVectorInstrCost(Instruction::InsertElement, VecTy, + /*Index=*/0, PoisonValue::get(VecTy), *It); + return InsertCost + (NeedShuffle + ? TTI->getShuffleCost( + TargetTransformInfo::SK_Broadcast, VecTy, + /*Mask=*/std::nullopt, CostKind, + /*Index=*/0, + /*SubTp=*/nullptr, /*Args=*/VL[0]) + : TTI::TCC_Free); } InstructionCost ReuseShuffleCost = 0; if (NeedToShuffleReuses) diff --git a/llvm/test/Analysis/CostModel/X86/loop_v2-inseltpoison.ll b/llvm/test/Analysis/CostModel/X86/loop_v2-inseltpoison.ll index 3e0f4c11aadf..1e96f97f16e9 100644 --- a/llvm/test/Analysis/CostModel/X86/loop_v2-inseltpoison.ll +++ b/llvm/test/Analysis/CostModel/X86/loop_v2-inseltpoison.ll @@ -20,7 +20,7 @@ vector.body: ; preds = %vector.body, %vecto %5 = extractelement <2 x i64> %2, i32 1 %6 = getelementptr inbounds i32, ptr %A, i64 %5 %7 = load i32, ptr %4, align 4 - ;CHECK: cost of 1 {{.*}} insert + ;CHECK: cost of 0 {{.*}} insert %8 = insertelement <2 x i32> poison, i32 %7, i32 0 %9 = load i32, ptr %6, align 4 ;CHECK: cost of 1 {{.*}} insert diff --git a/llvm/test/Analysis/CostModel/X86/loop_v2.ll b/llvm/test/Analysis/CostModel/X86/loop_v2.ll index a9cbaaf2fd63..8f67b365ca9b 100644 --- a/llvm/test/Analysis/CostModel/X86/loop_v2.ll +++ b/llvm/test/Analysis/CostModel/X86/loop_v2.ll @@ -20,7 +20,7 @@ vector.body: ; preds = %vector.body, %vecto %5 = extractelement <2 x i64> %2, i32 1 %6 = getelementptr inbounds i32, ptr %A, i64 %5 %7 = load i32, ptr %4, align 4 - ;CHECK: cost of 1 {{.*}} insert + ;CHECK: cost of 0 {{.*}} insert %8 = insertelement <2 x i32> undef, i32 %7, i32 0 %9 = load i32, ptr %6, align 4 ;CHECK: cost of 1 {{.*}} insert diff --git a/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost-inseltpoison.ll b/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost-inseltpoison.ll index 381e5b630812..897344d622d0 100644 --- a/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost-inseltpoison.ll +++ b/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost-inseltpoison.ll @@ -1907,7 +1907,7 @@ define <16 x float> @test_gather_16f32_ra_var_mask(<16 x ptr> %ptrs, <16 x i32>
define <16 x float> @test_gather_16f32_const_mask2(ptr %base, <16 x i32> %ind) { ; SSE2-LABEL: 'test_gather_16f32_const_mask2' -; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> poison, ptr %base, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> poison, ptr %base, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splat = shufflevector <16 x ptr> %broadcast.splatinsert, <16 x ptr> poison, <16 x i32> zeroinitializer ; SSE2-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %sext_ind = sext <16 x i32> %ind to <16 x i64> ; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %gep.random = getelementptr float, <16 x ptr> %broadcast.splat, <16 x i64> %sext_ind @@ -1966,7 +1966,7 @@ define <16 x float> @test_gather_16f32_const_mask2(ptr %base, <16 x i32> %ind) {
define void @test_scatter_16i32(ptr %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) { ; SSE2-LABEL: 'test_scatter_16i32' -; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> poison, ptr %base, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> poison, ptr %base, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splat = shufflevector <16 x ptr> %broadcast.splatinsert, <16 x ptr> poison, <16 x i32> zeroinitializer ; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %gep.random = getelementptr i32, <16 x ptr> %broadcast.splat, <16 x i32> %ind ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %imask = bitcast i16 %mask to <16 x i1> diff --git a/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll b/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll index 2fa41968e807..5f22b2e39f94 100644 --- a/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll +++ b/llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll @@ -1907,7 +1907,7 @@ define <16 x float> @test_gather_16f32_ra_var_mask(<16 x ptr> %ptrs, <16 x i32>
define <16 x float> @test_gather_16f32_const_mask2(ptr %base, <16 x i32> %ind) { ; SSE2-LABEL: 'test_gather_16f32_const_mask2' -; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> undef, ptr %base, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> undef, ptr %base, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splat = shufflevector <16 x ptr> %broadcast.splatinsert, <16 x ptr> undef, <16 x i32> zeroinitializer ; SSE2-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %sext_ind = sext <16 x i32> %ind to <16 x i64> ; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %gep.random = getelementptr float, <16 x ptr> %broadcast.splat, <16 x i64> %sext_ind @@ -1966,7 +1966,7 @@ define <16 x float> @test_gather_16f32_const_mask2(ptr %base, <16 x i32> %ind) {
define void @test_scatter_16i32(ptr %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) { ; SSE2-LABEL: 'test_scatter_16i32' -; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> undef, ptr %base, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splatinsert = insertelement <16 x ptr> undef, ptr %base, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %broadcast.splat = shufflevector <16 x ptr> %broadcast.splatinsert, <16 x ptr> undef, <16 x i32> zeroinitializer ; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %gep.random = getelementptr i32, <16 x ptr> %broadcast.splat, <16 x i32> %ind ; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %imask = bitcast i16 %mask to <16 x i1> diff --git a/llvm/test/Analysis/CostModel/X86/vector-insert-inseltpoison.ll b/llvm/test/Analysis/CostModel/X86/vector-insert-inseltpoison.ll index 2296b3d5b0c4..e6a4de688186 100644 --- a/llvm/test/Analysis/CostModel/X86/vector-insert-inseltpoison.ll +++ b/llvm/test/Analysis/CostModel/X86/vector-insert-inseltpoison.ll @@ -382,58 +382,58 @@ define i32 @insert_i64(i32 %arg) { define i32 @insert_i32(i32 %arg) { ; SSE2-LABEL: 'insert_i32' ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_a = insertelement <2 x i32> poison, i32 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_1 = insertelement <2 x i32> poison, i32 undef, i32 1 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_a = insertelement <4 x i32> poison, i32 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_3 = insertelement <4 x i32> poison, i32 undef, i32 3 ; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v8i32_a = insertelement <8 x i32> poison, i32 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_3 = insertelement <8 x i32> poison, i32 undef, i32 3 -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_7 = insertelement <8 x i32> poison, i32 undef, i32 7 ; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v16i32_a = insertelement <16 x i32> poison, i32 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_3 = insertelement <16 x i32> poison, i32 undef, i32 3 -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_15 = insertelement <16 x i32> poison, i32 undef, i32 15 ; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef ; ; SSE3-LABEL: 'insert_i32' ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_a = insertelement <2 x i32> poison, i32 undef, i32 %arg -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_1 = insertelement <2 x i32> poison, i32 undef, i32 1 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_a = insertelement <4 x i32> poison, i32 undef, i32 %arg -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_3 = insertelement <4 x i32> poison, i32 undef, i32 3 ; SSE3-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v8i32_a = insertelement <8 x i32> poison, i32 undef, i32 %arg -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_3 = insertelement <8 x i32> poison, i32 undef, i32 3 -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_7 = insertelement <8 x i32> poison, i32 undef, i32 7 ; SSE3-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v16i32_a = insertelement <16 x i32> poison, i32 undef, i32 %arg -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_3 = insertelement <16 x i32> poison, i32 undef, i32 3 -; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 +; SSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 ; SSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_15 = insertelement <16 x i32> poison, i32 undef, i32 15 ; SSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef ; ; SSSE3-LABEL: 'insert_i32' ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_a = insertelement <2 x i32> poison, i32 undef, i32 %arg -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2i32_0 = insertelement <2 x i32> poison, i32 undef, i32 0 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i32_1 = insertelement <2 x i32> poison, i32 undef, i32 1 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_a = insertelement <4 x i32> poison, i32 undef, i32 %arg -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i32_0 = insertelement <4 x i32> poison, i32 undef, i32 0 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i32_3 = insertelement <4 x i32> poison, i32 undef, i32 3 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v8i32_a = insertelement <8 x i32> poison, i32 undef, i32 %arg -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_0 = insertelement <8 x i32> poison, i32 undef, i32 0 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_3 = insertelement <8 x i32> poison, i32 undef, i32 3 -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_4 = insertelement <8 x i32> poison, i32 undef, i32 4 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_7 = insertelement <8 x i32> poison, i32 undef, i32 7 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v16i32_a = insertelement <16 x i32> poison, i32 undef, i32 %arg -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_0 = insertelement <16 x i32> poison, i32 undef, i32 0 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_3 = insertelement <16 x i32> poison, i32 undef, i32 3 -; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 +; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_8 = insertelement <16 x i32> poison, i32 undef, i32 8 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i32_15 = insertelement <16 x i32> poison, i32 undef, i32 15 ; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef ; @@ -664,100 +664,100 @@ define i32 @insert_i16(i32 %arg) { define i32 @insert_i8(i32 %arg) { ; SSE2-LABEL: 'insert_i8' ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i8_a = insertelement <2 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i8_0 = insertelement <2 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2i8_0 = insertelement <2 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2i8_3 = insertelement <2 x i8> poison, i8 undef, i32 1 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i8_a = insertelement <4 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v4i8_0 = insertelement <4 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i8_0 = insertelement <4 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v4i8_3 = insertelement <4 x i8> poison, i8 undef, i32 3 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i8_a = insertelement <8 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v8i8_0 = insertelement <8 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i8_0 = insertelement <8 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v8i8_7 = insertelement <8 x i8> poison, i8 undef, i32 7 ; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v16i8_a = insertelement <16 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v16i8_0 = insertelement <16 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i8_0 = insertelement <16 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v16i8_8 = insertelement <16 x i8> poison, i8 undef, i32 8 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v16i8_15 = insertelement <16 x i8> poison, i8 undef, i32 15 ; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %v32i8_a = insertelement <32 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_0 = insertelement <32 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v32i8_0 = insertelement <32 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_7 = insertelement <32 x i8> poison, i8 undef, i32 7 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_8 = insertelement <32 x i8> poison, i8 undef, i32 8 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_15 = insertelement <32 x i8> poison, i8 undef, i32 15 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_24 = insertelement <32 x i8> poison, i8 undef, i32 24 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i8_31 = insertelement <32 x i8> poison, i8 undef, i32 31 ; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v64i8_a = insertelement <64 x i8> poison, i8 undef, i32 %arg -; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_0 = insertelement <64 x i8> poison, i8 undef, i32 0 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v64i8_0 = insertelement <64 x i8> poison, i8 undef, i32 0 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_7 = insertelement <64 x i8> poison, i8 undef, i32 7 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_8 = insertelement <64 x i8> poison, i8 undef, i32 8 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_15 = insertelement <64 x i8> poison, i8 undef, i32 15 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_24 = insertelement <64 x i8> poison, i8 undef, i32 24 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_31 = insertelement <64 x i8> poison, i8 undef, i32 31 -; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_32 = insertelement <64 x i8> poison, i8 undef, i32 32 -; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_48 = insertelement <64 x i8> poison, i8 undef, i32 48 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v64i8_32 = insertelement <64 x i8> poison, i8 undef, i32 32 +; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v64i8_48 = insertelement <64 x i8> poison, i8 undef, i32 48 ; SSE2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v64i8_63 = insertelement <64 x i8> poison, i8 undef, i32 63 </cut>