linaro-toolchain October 2021

linaro-toolchain@lists.linaro.org

16 participants
25 discussions

by Stefan Johansson A

Hello, We have been using Linaro GCC 7.5-2019.12 for the A53. As we move on to new tech there seems to be no support for "- mcpu=cortex-a55". Today, we use the aarch64-elf- toolchain. What GCC do you suggest we start using for A55 ? Thanks, Stefan

4 years, 4 months

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [BasicAA] Handle known bits as ranges

by ci_notify＠linaro.org

After llvm commit fbc0c308d599fe3300ab6516650b65b41979446d Author: Nikita Popov <nikita.ppv(a)gmail.com> [BasicAA] Handle known bits as ranges the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 10899 to 11610 perf samples - 464.h264ref:libc.so.6 slowed down by 11% from 3538 to 3922 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d cd investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach fbc0c308d599fe3300ab6516650b65b41979446d ../artifacts/test.sh # Reproduce last_good build git checkout --detach 30a3652b6ade43504087f6e3acd8dc879055f501 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit fbc0c308d599fe3300ab6516650b65b41979446d Author: Nikita Popov <nikita.ppv(a)gmail.com> Date: Mon Oct 25 15:47:21 2021 +0200 [BasicAA] Handle known bits as ranges BasicAA currently tries to determine that the offset is positive by checking whether all variable indices are positive based on known bits, multiplied by a positive scale. However, this is incorrect if the scale multiplication might overflow. In the modified test case the original value is positive, but may be negative after a left shift. Fix this by converting known bits into a constant range and reusing the range-based logic, which handles overflow correctly. Differential Revision: https://reviews.llvm.org/D112611 --- llvm/lib/Analysis/BasicAliasAnalysis.cpp | 51 +++++----------------- .../test/Analysis/BasicAA/assume-index-positive.ll | 4 +- 2 files changed, 12 insertions(+), 43 deletions(-) diff --git a/llvm/lib/Analysis/BasicAliasAnalysis.cpp b/llvm/lib/Analysis/BasicAliasAnalysis.cpp index 0305732ca5d5..8cf947c43bf4 100644 --- a/llvm/lib/Analysis/BasicAliasAnalysis.cpp +++ b/llvm/lib/Analysis/BasicAliasAnalysis.cpp @@ -318,15 +318,6 @@ struct CastedValue { return N; } - KnownBits evaluateWith(KnownBits N) const { - assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() && - "Incompatible bit width"); - if (TruncBits) N = N.trunc(N.getBitWidth() - TruncBits); - if (SExtBits) N = N.sext(N.getBitWidth() + SExtBits); - if (ZExtBits) N = N.zext(N.getBitWidth() + ZExtBits); - return N; - } - ConstantRange evaluateWith(ConstantRange N) const { assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() && "Incompatible bit width"); @@ -1250,8 +1241,6 @@ AliasResult BasicAAResult::aliasGEP( if (!DecompGEP1.VarIndices.empty()) { APInt GCD; - bool AllNonNegative = DecompGEP1.Offset.isNonNegative(); - bool AllNonPositive = DecompGEP1.Offset.isNonPositive(); ConstantRange OffsetRange = ConstantRange(DecompGEP1.Offset); for (unsigned i = 0, e = DecompGEP1.VarIndices.size(); i != e; ++i) { const VariableGEPIndex &Index = DecompGEP1.VarIndices[i]; @@ -1266,24 +1255,19 @@ AliasResult BasicAAResult::aliasGEP( else GCD = APIntOps::GreatestCommonDivisor(GCD, ScaleForGCD.abs()); - if (AllNonNegative || AllNonPositive) { - KnownBits Known = Index.Val.evaluateWith( - computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT)); - bool SignKnownZero = Known.isNonNegative(); - bool SignKnownOne = Known.isNegative(); - AllNonNegative &= (SignKnownZero && Scale.isNonNegative()) || - (SignKnownOne && Scale.isNonPositive()); - AllNonPositive &= (SignKnownZero && Scale.isNonPositive()) || - (SignKnownOne && Scale.isNonNegative()); - } + ConstantRange CR = + computeConstantRange(Index.Val.V, true, &AC, Index.CxtI); + KnownBits Known = + computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT); + CR = CR.intersectWith( + ConstantRange::fromKnownBits(Known, /* Signed */ true), + ConstantRange::Signed); assert(OffsetRange.getBitWidth() == Scale.getBitWidth() && "Bit widths are normalized to MaxPointerSize"); - OffsetRange = OffsetRange.add(Index.Val - .evaluateWith(computeConstantRange( - Index.Val.V, true, &AC, Index.CxtI)) - .sextOrTrunc(OffsetRange.getBitWidth()) - .smul_fast(ConstantRange(Scale))); + OffsetRange = OffsetRange.add( + Index.Val.evaluateWith(CR).sextOrTrunc(OffsetRange.getBitWidth()) + .smul_fast(ConstantRange(Scale))); } // We now have accesses at two offsets from the same base: @@ -1300,21 +1284,6 @@ AliasResult BasicAAResult::aliasGEP( (GCD - ModOffset).uge(V1Size.getValue())) return AliasResult::NoAlias; - // If we know all the variables are non-negative, then the total offset is - // also non-negative and >= DecompGEP1.Offset. We have the following layout: - // [0, V2Size) ... [TotalOffset, TotalOffer+V1Size] - // If DecompGEP1.Offset >= V2Size, the accesses don't alias. - if (AllNonNegative && V2Size.hasValue() && - DecompGEP1.Offset.uge(V2Size.getValue())) - return AliasResult::NoAlias; - // Similarly, if the variables are non-positive, then the total offset is - // also non-positive and <= DecompGEP1.Offset. We have the following layout: - // [TotalOffset, TotalOffset+V1Size) ... [0, V2Size) - // If -DecompGEP1.Offset >= V1Size, the accesses don't alias. - if (AllNonPositive && V1Size.hasValue() && - (-DecompGEP1.Offset).uge(V1Size.getValue())) - return AliasResult::NoAlias; - if (V1Size.hasValue() && V2Size.hasValue()) { // Compute ranges of potentially accessed bytes for both accesses. If the // interseciton is empty, there can be no overlap. diff --git a/llvm/test/Analysis/BasicAA/assume-index-positive.ll b/llvm/test/Analysis/BasicAA/assume-index-positive.ll index 451592067f4b..a53fff2c6009 100644 --- a/llvm/test/Analysis/BasicAA/assume-index-positive.ll +++ b/llvm/test/Analysis/BasicAA/assume-index-positive.ll @@ -130,12 +130,12 @@ define void @symmetry([0 x i8]* %ptr, i32 %a, i32 %b, i32 %c) { ret void } -; TODO: %ptr.neg and %ptr.shl may alias, as the shl renders the previously +; %ptr.neg and %ptr.shl may alias, as the shl renders the previously ; non-negative value potentially negative. define void @shl_of_non_negative(i8* %ptr, i64 %a) { ; CHECK-LABEL: Function: shl_of_non_negative ; CHECK: NoAlias: i8* %ptr.a, i8* %ptr.neg -; CHECK: NoAlias: i8* %ptr.neg, i8* %ptr.shl +; CHECK: MayAlias: i8* %ptr.neg, i8* %ptr.shl %a.cmp = icmp sge i64 %a, 0 call void @llvm.assume(i1 %a.cmp) %ptr.neg = getelementptr i8, i8* %ptr, i64 -2 </cut>

4 years, 4 months

[TCWG CI] 400.perlbench:[.] S_find_byclass slowed down by 12% after llvm: [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it.

by ci_notify＠linaro.org

After llvm commit adf55ac6657693f7bfbe3087b599b4031a765a44 Author: Lang Hames <lhames(a)gmail.com> [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it. the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 400.perlbench:[.] S_find_byclass slowed down by 12% from 644 to 721 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44 cd investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach adf55ac6657693f7bfbe3087b599b4031a765a44 ../artifacts/test.sh # Reproduce last_good build git checkout --detach f526ee5b8517b60620cd03bb3e5945ed69d6bfaa ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit adf55ac6657693f7bfbe3087b599b4031a765a44 Author: Lang Hames <lhames(a)gmail.com> Date: Tue Oct 12 14:55:49 2021 -0700 [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it. Another follow-up to 2815ed57e3c and 19b4e3cfc6a. For unit tests that don't use an ExecutionSession we need to call ExecutorProcessControl::disconnect directly to wait for the dispatcher to shut down. https://llvm.org/PR52153 --- .../ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp | 2 ++ llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp | 2 ++ 2 files changed, 4 insertions(+) diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp index f2b157e424b6..a95435aec2a3 100644 --- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp +++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp @@ -134,6 +134,8 @@ TEST(EPCGenericJITLinkMemoryManagerTest, AllocFinalizeFree) { auto Err2 = MemMgr->deallocate(std::move(*FA)); EXPECT_THAT_ERROR(std::move(Err2), Succeeded()); + + cantFail(SelfEPC->disconnect()); } } // namespace diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp index 78024644ca8b..beb0fefa094a 100644 --- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp +++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp @@ -93,6 +93,8 @@ TEST(EPCGenericMemoryAccessTest, MemWrites) { {{pointerToJITTargetAddress(&Test_Buffer), TestMsg}}); EXPECT_THAT_ERROR(std::move(Err5), Succeeded()); EXPECT_EQ(StringRef(Test_Buffer, TestMsg.size()), TestMsg); + + cantFail(SelfEPC->disconnect()); } } // namespace </cut>

4 years, 5 months

[TCWG CI] 400.perlbench slowed down by 6% after llvm: [AArch64] Remove redundant ORRWrs which is generated by zero-extend

by ci_notify＠linaro.org

After llvm commit a502436259307f95e9c95437d8a1d2d07294341c Author: Jingu Kang <jingu.kang(a)arm.com> [AArch64] Remove redundant ORRWrs which is generated by zero-extend the following benchmarks slowed down by more than 2%: - 400.perlbench slowed down by 6% from 9792 to 10354 perf samples - 464.h264ref slowed down by 4% from 11023 to 11509 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 33% from 1634 to 2180 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-a502436259307f95e9c95437d8a1d2d07294341c cd investigate-llvm-a502436259307f95e9c95437d8a1d2d07294341c # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach a502436259307f95e9c95437d8a1d2d07294341c ../artifacts/test.sh # Reproduce last_good build git checkout --detach 6fa1b4ff4b05b9b9a432f7310802255c160c8f4f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit a502436259307f95e9c95437d8a1d2d07294341c Author: Jingu Kang <jingu.kang(a)arm.com> Date: Thu Sep 30 15:39:10 2021 +0100 [AArch64] Remove redundant ORRWrs which is generated by zero-extend %3:gpr32 = ORRWrs $wzr, %2, 0 %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 If AArch64's 32-bit form of instruction defines the source operand of ORRWrs, we can remove the ORRWrs because the upper 32 bits of the source operand are set to zero. Differential Revision: https://reviews.llvm.org/D110841 --- llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp | 57 ++++++++++++++++ .../test/CodeGen/AArch64/arm64-assert-zext-sext.ll | 51 +++++++++++--- .../AArch64/redundant-mov-from-zero-extend.ll | 79 ++++++++++++++++++++++ .../AArch64/redundant-orrwrs-from-zero-extend.mir | 69 +++++++++++++++++++ 4 files changed, 248 insertions(+), 8 deletions(-) diff --git a/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp b/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp index 42f683613698..42db18332f1c 100644 --- a/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp +++ b/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp @@ -15,6 +15,16 @@ // later. In this case, we could try to split the constant operand of mov // instruction into two bitmask immediates. It makes two AND instructions // intead of multiple `mov` + `and` instructions. +// +// 2. Remove redundant ORRWrs which is generated by zero-extend. +// +// %3:gpr32 = ORRWrs $wzr, %2, 0 +// %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 +// +// If AArch64's 32-bit form of instruction defines the source operand of +// ORRWrs, we can remove the ORRWrs because the upper 32 bits of the source +// operand are set to zero. +// //===----------------------------------------------------------------------===// #include "AArch64ExpandImm.h" @@ -44,6 +54,8 @@ struct AArch64MIPeepholeOpt : public MachineFunctionPass { template <typename T> bool visitAND(MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved); + bool visitORR(MachineInstr &MI, + SmallSetVector<MachineInstr *, 8> &ToBeRemoved); bool runOnMachineFunction(MachineFunction &MF) override; StringRef getPassName() const override { @@ -196,6 +208,49 @@ bool AArch64MIPeepholeOpt::visitAND( return true; } +bool AArch64MIPeepholeOpt::visitORR( + MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved) { + // Check this ORR comes from below zero-extend pattern. + // + // def : Pat<(i64 (zext GPR32:$src)), + // (SUBREG_TO_REG (i32 0), (ORRWrs WZR, GPR32:$src, 0), sub_32)>; + if (MI.getOperand(3).getImm() != 0) + return false; + + if (MI.getOperand(1).getReg() != AArch64::WZR) + return false; + + MachineInstr *SrcMI = MRI->getUniqueVRegDef(MI.getOperand(2).getReg()); + if (!SrcMI) + return false; + + // From https://developer.arm.com/documentation/dui0801/b/BABBGCAC + // + // When you use the 32-bit form of an instruction, the upper 32 bits of the + // source registers are ignored and the upper 32 bits of the destination + // register are set to zero. + // + // If AArch64's 32-bit form of instruction defines the source operand of + // zero-extend, we do not need the zero-extend. Let's check the MI's opcode is + // real AArch64 instruction and if it is not, do not process the opcode + // conservatively. + if (SrcMI->getOpcode() <= TargetOpcode::GENERIC_OP_END) + return false; + + Register DefReg = MI.getOperand(0).getReg(); + Register SrcReg = MI.getOperand(2).getReg(); + MRI->replaceRegWith(DefReg, SrcReg); + MRI->clearKillFlags(SrcReg); + // replaceRegWith changes MI's definition register. Keep it for SSA form until + // deleting MI. + MI.getOperand(0).setReg(DefReg); + ToBeRemoved.insert(&MI); + + LLVM_DEBUG({ dbgs() << "Removed: " << MI << "\n"; }); + + return true; +} + bool AArch64MIPeepholeOpt::runOnMachineFunction(MachineFunction &MF) { if (skipFunction(MF.getFunction())) return false; @@ -221,6 +276,8 @@ bool AArch64MIPeepholeOpt::runOnMachineFunction(MachineFunction &MF) { case AArch64::ANDXrr: Changed = visitAND<uint64_t>(MI, ToBeRemoved); break; + case AArch64::ORRWrs: + Changed = visitORR(MI, ToBeRemoved); } } } diff --git a/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll b/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll index 07dd0b4ec56b..df4a9010dfa9 100644 --- a/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll +++ b/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll @@ -1,9 +1,32 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py ; RUN: llc -O2 -mtriple=aarch64-linux-gnu < %s | FileCheck %s declare i32 @test(i32) local_unnamed_addr declare i32 @test1(i64) local_unnamed_addr define i32 @assertzext(i32 %n, i1 %a, i32* %b) local_unnamed_addr { +; CHECK-LABEL: assertzext: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: stp x30, x19, [sp, #-16]! // 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 16 +; CHECK-NEXT: .cfi_offset w19, -8 +; CHECK-NEXT: .cfi_offset w30, -16 +; CHECK-NEXT: mov w8, #33066 +; CHECK-NEXT: tst w1, #0x1 +; CHECK-NEXT: movk w8, #28567, lsl #16 +; CHECK-NEXT: csel w19, wzr, w8, ne +; CHECK-NEXT: cbnz w0, .LBB0_2 +; CHECK-NEXT: // %bb.1: // %if.then +; CHECK-NEXT: mov w19, wzr +; CHECK-NEXT: str wzr, [x2] +; CHECK-NEXT: .LBB0_2: // %if.end +; CHECK-NEXT: mov w0, w19 +; CHECK-NEXT: bl test +; CHECK-NEXT: mov w0, w19 +; CHECK-NEXT: bl test1 +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload +; CHECK-NEXT: ret entry: %i = select i1 %a, i64 0, i64 66296709418 %conv.i = trunc i64 %i to i32 @@ -20,14 +43,29 @@ if.end: ; preds = %if.then, %entry %i2 = sext i32 %i1 to i64 %call1.i = tail call i32 @test1(i64 %i2) ret i32 0 -; CHECK: // %if.end -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: bl test -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: bl test1 } define i32 @assertsext(i32 %n, i8 %a) local_unnamed_addr { +; CHECK-LABEL: assertsext: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: cbz w0, .LBB1_2 +; CHECK-NEXT: // %bb.1: +; CHECK-NEXT: mov x0, xzr +; CHECK-NEXT: b .LBB1_3 +; CHECK-NEXT: .LBB1_2: // %if.then +; CHECK-NEXT: mov x9, #24575 +; CHECK-NEXT: sxtb w8, w1 +; CHECK-NEXT: movk x9, #15873, lsl #16 +; CHECK-NEXT: movk x9, #474, lsl #32 +; CHECK-NEXT: udiv x0, x9, x8 +; CHECK-NEXT: .LBB1_3: // %if.end +; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 16 +; CHECK-NEXT: .cfi_offset w30, -16 +; CHECK-NEXT: bl test1 +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload +; CHECK-NEXT: ret entry: %conv.i = sext i8 %a to i32 %cmp = icmp eq i32 %n, 0 @@ -37,9 +75,6 @@ if.then: ; preds = %entry %conv1 = zext i32 %conv.i to i64 %div = udiv i64 2036854775807, %conv1 br label %if.end -; CHECK: // %if.then -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: udiv x{{[0-9]+}}, x{{[0-9]+}}, x{{[0-9]+}} if.end: ; preds = %if.then, %entry %i1 = phi i64 [ %div, %if.then ], [ 0, %entry ] diff --git a/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll new file mode 100644 index 000000000000..42b9838acef2 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll @@ -0,0 +1,79 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py +; RUN: llc -O3 -mtriple=aarch64-linux-gnu < %s | FileCheck %s + +define i32 @test(i32 %input, i32 %n, i32 %a) { +; CHECK-LABEL: test: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: cbz w1, .LBB0_2 +; CHECK-NEXT: // %bb.1: +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_2: // %bb.0 +; CHECK-NEXT: add w8, w0, w1 +; CHECK-NEXT: mov w0, #100 +; CHECK-NEXT: cmp w8, #4 +; CHECK-NEXT: b.hi .LBB0_5 +; CHECK-NEXT: // %bb.3: // %bb.0 +; CHECK-NEXT: adrp x9, .LJTI0_0 +; CHECK-NEXT: add x9, x9, :lo12:.LJTI0_0 +; CHECK-NEXT: adr x10, .LBB0_4 +; CHECK-NEXT: ldrb w11, [x9, x8] +; CHECK-NEXT: add x10, x10, x11, lsl #2 +; CHECK-NEXT: br x10 +; CHECK-NEXT: .LBB0_4: // %sw.bb +; CHECK-NEXT: add w0, w2, #1 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_5: // %bb.0 +; CHECK-NEXT: cmp w8, #200 +; CHECK-NEXT: b.ne .LBB0_10 +; CHECK-NEXT: // %bb.6: // %sw.bb7 +; CHECK-NEXT: add w0, w2, #7 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_7: // %sw.bb1 +; CHECK-NEXT: add w0, w2, #3 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_8: // %sw.bb3 +; CHECK-NEXT: add w0, w2, #4 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_9: // %sw.bb5 +; CHECK-NEXT: add w0, w2, #5 +; CHECK-NEXT: .LBB0_10: // %return +; CHECK-NEXT: ret +entry: + %b = add nsw i32 %input, %n + %cmp = icmp eq i32 %n, 0 + br i1 %cmp, label %bb.0, label %return + +bb.0: + switch i32 %b, label %return [ + i32 0, label %sw.bb + i32 1, label %sw.bb1 + i32 2, label %sw.bb3 + i32 4, label %sw.bb5 + i32 200, label %sw.bb7 + ] + +sw.bb: + %add = add nsw i32 %a, 1 + br label %return + +sw.bb1: + %add2 = add nsw i32 %a, 3 + br label %return + +sw.bb3: + %add4 = add nsw i32 %a, 4 + br label %return + +sw.bb5: + %add6 = add nsw i32 %a, 5 + br label %return + +sw.bb7: + %add8 = add nsw i32 %a, 7 + br label %return + +return: + %retval.0 = phi i32 [ %add8, %sw.bb7 ], [ %add6, %sw.bb5 ], [ %add4, %sw.bb3 ], [ %add2, %sw.bb1 ], [ %add, %sw.bb ], [ 100, %bb.0 ], [ 0, %entry ] + ret i32 %retval.0 +} diff --git a/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir b/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir new file mode 100644 index 000000000000..37540dde048f --- /dev/null +++ b/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir @@ -0,0 +1,69 @@ +# RUN: llc -mtriple=aarch64 -run-pass aarch64-mi-peephole-opt -verify-machineinstrs -o - %s | FileCheck %s +--- +name: test1 +# CHECK-LABEL: name: test1 +alignment: 4 +tracksRegLiveness: true +registers: + - { id: 0, class: gpr32 } + - { id: 1, class: gpr32 } + - { id: 2, class: gpr32 } + - { id: 3, class: gpr32 } + - { id: 4, class: gpr64 } +body: | + bb.0: + liveins: $w0, $w1 + + %0:gpr32 = COPY $w0 + %1:gpr32 = COPY $w1 + B %bb.1 + + bb.1: + %2:gpr32 = nsw ADDWrr %0, %1 + B %bb.2 + + bb.2: + ; CHECK-LABEL: bb.2: + ; CHECK-NOT: %3:gpr32 = ORRWrs $wzr, %2, 0 + ; The ORRWrs should be removed. + %3:gpr32 = ORRWrs $wzr, %2, 0 + %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 + B %bb.3 + + bb.3: + $x0 = COPY %4 + RET_ReallyLR implicit $x0 +... +--- +name: test2 +# CHECK-LABEL: name: test2 +alignment: 4 +tracksRegLiveness: true +registers: + - { id: 0, class: gpr64 } + - { id: 1, class: gpr32 } + - { id: 2, class: gpr32 } + - { id: 3, class: gpr64 } +body: | + bb.0: + liveins: $x0 + + %0:gpr64 = COPY $x0 + B %bb.1 + + bb.1: + %1:gpr32 = EXTRACT_SUBREG %0, %subreg.sub_32 + B %bb.2 + + bb.2: + ; CHECK-LABEL: bb.2: + ; CHECK: %2:gpr32 = ORRWrs $wzr, %1, 0 + ; The ORRWrs should not be removed. + %2:gpr32 = ORRWrs $wzr, %1, 0 + %3:gpr64 = SUBREG_TO_REG 0, %2, %subreg.sub_32 + B %bb.3 + + bb.3: + $x0 = COPY %3 + RET_ReallyLR implicit $x0 +... </cut>

4 years, 5 months

[TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler

by ci_notify＠linaro.org

After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad <jay.foad(a)amd.com> [AMDGPU] Enable load clustering in the post-RA scheduler the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec ../artifacts/test.sh # Reproduce last_good build git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad <jay.foad(a)amd.com> Date: Tue Oct 12 15:39:43 2021 +0100 [AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. --- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | 1 + llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll | 5 ++--- llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll | 5 +++-- llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll | 4 ++-- llvm/test/CodeGen/AMDGPU/idiv-licm.ll | 2 +- llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll | 6 +++--- llvm/test/CodeGen/AMDGPU/sdiv64.ll | 2 +- llvm/test/CodeGen/AMDGPU/srem64.ll | 2 +- llvm/test/CodeGen/AMDGPU/udiv64.ll | 2 +- llvm/test/CodeGen/AMDGPU/urem64.ll | 2 +- 10 files changed, 16 insertions(+), 15 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp index b0902465c592..7b2d56e88b5f 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp @@ -825,6 +825,7 @@ public: createPostMachineScheduler(MachineSchedContext *C) const override { ScheduleDAGMI *DAG = createGenericSchedPostRA(C); const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>(); + DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI)); DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII)); return DAG; } diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll index fa500054e058..804dea705011 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll @@ -185,21 +185,20 @@ define i128 @extractelement_vgpr_v4i128_vgpr_idx(<4 x i128> addrspace(1)* %ptr, ; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX8-NEXT: v_add_u32_e32 v3, vcc, 16, v0 ; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v1, vcc -; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1] ; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[3:4] +; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1] ; GFX8-NEXT: v_lshlrev_b32_e32 v16, 1, v2 ; GFX8-NEXT: v_add_u32_e32 v17, vcc, 1, v16 ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v17 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[4:5], 1, v16 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[6:7], 6, v16 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[8:9], 7, v16 -; GFX8-NEXT: s_waitcnt vmcnt(1) +; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: v_cndmask_b32_e64 v2, v8, v10, s[4:5] ; GFX8-NEXT: v_cndmask_b32_e64 v3, v9, v11, s[4:5] ; GFX8-NEXT: v_cndmask_b32_e32 v8, v8, v10, vcc ; GFX8-NEXT: v_cndmask_b32_e32 v9, v9, v11, vcc ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v16 -; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: v_cndmask_b32_e32 v2, v2, v4, vcc ; GFX8-NEXT: v_cndmask_b32_e32 v3, v3, v5, vcc ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v17 diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll index 133a224b7437..bd4ecd3a17e5 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll @@ -830,8 +830,8 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32 ; GFX9-LABEL: udivrem_v4i32: ; GFX9: ; %bb.0: ; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x20 -; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe ; GFX9-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x10 +; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe ; GFX9-NEXT: s_waitcnt lgkmcnt(0) ; GFX9-NEXT: v_cvt_f32_u32_e32 v0, s0 ; GFX9-NEXT: v_cvt_f32_u32_e32 v1, s1 @@ -926,9 +926,10 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32 ; ; GFX10-LABEL: udivrem_v4i32: ; GFX10: ; %bb.0: +; GFX10-NEXT: s_clause 0x1 ; GFX10-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x20 -; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe ; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x10 +; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe ; GFX10-NEXT: v_mov_b32_e32 v8, 0 ; GFX10-NEXT: s_waitcnt lgkmcnt(0) ; GFX10-NEXT: v_cvt_f32_u32_e32 v0, s8 diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll index b033497d3aed..81b055166dd2 100644 --- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll +++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll @@ -11236,8 +11236,8 @@ define amdgpu_kernel void @sdiv_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 % ; GFX6-LABEL: sdiv_i64_pow2_shl_denom: ; GFX6: ; %bb.0: ; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd -; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_mov_b32 s7, 0xf000 ; GFX6-NEXT: s_mov_b32 s6, -1 ; GFX6-NEXT: s_waitcnt lgkmcnt(0) @@ -13358,8 +13358,8 @@ define amdgpu_kernel void @srem_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 % ; GFX6-LABEL: srem_i64_pow2_shl_denom: ; GFX6: ; %bb.0: ; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd -; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_mov_b32 s7, 0xf000 ; GFX6-NEXT: s_mov_b32 s6, -1 ; GFX6-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll index fb9348bae000..9ea8f101b5e9 100644 --- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll +++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll @@ -491,8 +491,8 @@ define amdgpu_kernel void @urem16_invariant_denom(i16 addrspace(1)* nocapture %a ; GFX9-LABEL: urem16_invariant_denom: ; GFX9: ; %bb.0: ; %bb ; GFX9-NEXT: s_load_dword s2, s[0:1], 0x2c -; GFX9-NEXT: s_mov_b32 s6, 0xffff ; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x24 +; GFX9-NEXT: s_mov_b32 s6, 0xffff ; GFX9-NEXT: v_mov_b32_e32 v1, 0 ; GFX9-NEXT: s_movk_i32 s8, 0x400 ; GFX9-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll index e2fbc0bc4af9..ba093ad3771d 100644 --- a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll +++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll @@ -100,14 +100,14 @@ define hidden amdgpu_kernel void @clmem_read(i8 addrspace(1)* %buffer) { ; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} diff --git a/llvm/test/CodeGen/AMDGPU/sdiv64.ll b/llvm/test/CodeGen/AMDGPU/sdiv64.ll index 0b80b4170316..dbb6d4805495 100644 --- a/llvm/test/CodeGen/AMDGPU/sdiv64.ll +++ b/llvm/test/CodeGen/AMDGPU/sdiv64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* %out, i64 %x, i64 %y) { ; GCN-LABEL: s_test_sdiv: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v7, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v7, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/srem64.ll b/llvm/test/CodeGen/AMDGPU/srem64.ll index fac510e8dbda..04f8ea10545e 100644 --- a/llvm/test/CodeGen/AMDGPU/srem64.ll +++ b/llvm/test/CodeGen/AMDGPU/srem64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* %out, i64 %x, i64 %y) { ; GCN-LABEL: s_test_srem: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/udiv64.ll b/llvm/test/CodeGen/AMDGPU/udiv64.ll index cc829b8e7eb3..48a86eec9832 100644 --- a/llvm/test/CodeGen/AMDGPU/udiv64.ll +++ b/llvm/test/CodeGen/AMDGPU/udiv64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* %out, i64 %x, i64 % ; GCN-LABEL: s_test_udiv_i64: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/urem64.ll b/llvm/test/CodeGen/AMDGPU/urem64.ll index a0a4b73262a7..296aaf2ed1c6 100644 --- a/llvm/test/CodeGen/AMDGPU/urem64.ll +++ b/llvm/test/CodeGen/AMDGPU/urem64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* %out, i64 %x, i64 % ; GCN-LABEL: s_test_urem_i64: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) </cut>

4 years, 5 months

[ACTIVITY] 18 - 22 Oct 2021

by Prathamesh Kulkarni

== This Week == * GCC - Committed a clean up patch to gimple-isel - PR93183: Committed fix - PR102376: Patch approved upstream - PR83750: Patch approved upstream but it regresses one test-case. == Next Week == - Continue with ongoing tasks

4 years, 5 months

[TCWG CI] 444.namd grew in size by 2% after llvm: [SLP]Improve graph reordering.

by ci_notify＠linaro.org

After llvm commit bc69dd62c04a70d29943c1c06c7effed150b70e1 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve graph reordering. the following benchmarks grew in size by more than 1%: - 444.namd grew in size by 2% from 192302 to 195218 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -Os - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 cd investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bc69dd62c04a70d29943c1c06c7effed150b70e1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 5661317f864abf750cf893c6a4cc7a977be0995a ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bc69dd62c04a70d29943c1c06c7effed150b70e1 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Tue Aug 3 13:20:32 2021 -0700 [SLP]Improve graph reordering. Reworked reordering algorithm. Originally, the compiler just tried to detect the most common order in the reordarable nodes (loads, stores, extractelements,extractvalues) and then fully rebuilding the graph in the best order. This was not effecient, since it required an extra memory and time for building/rebuilding tree, double the use of the scheduling budget, which could lead to missing vectorization due to exausted scheduling resources. Patch provide 2-way approach for graph reodering problem. At first, all reordering is done in-place, it doe not required tree deleting/rebuilding, it just rotates the scalars/orders/reuses masks in the graph node. The first step (top-to bottom) rotates the whole graph, similarly to the previous implementation. Compiler counts the number of the most used orders of the graph nodes with the same vectorization factor and then rotates the subgraph with the given vectorization factor to the most used order, if it is not empty. Then repeats the same procedure for the subgraphs with the smaller vectorization factor. We can do this because we still need to reshuffle smaller subgraph when buildiong operands for the graph nodes with lasrger vectorization factor, we can rotate just subgraph, not the whole graph. The second step (bottom-to-top) scans through the leaves and tries to detect the users of the leaves which can be reordered. If the leaves can be reorder in the best fashion, they are reordered and their user too. It allows to remove double shuffles to the same ordering of the operands in many cases and just reorder the user operations instead. Plus, it moves the final shuffles closer to the top of the graph and in many cases allows to remove extra shuffle because the same procedure is repeated again and we can again merge some reordering masks and reorder user nodes instead of the operands. Also, patch improves cost model for gathering of loads, which improves x264 benchmark in some cases. Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264, +3% for 508.namd, improves most of other benchmarks. The compile and link time are almost the same, though in some cases it should be better (we're not doing an extra instruction scheduling anymore) + we may vectorize more code for the large basic blocks again because of saving scheduling budget. Differential Revision: https://reviews.llvm.org/D105020 --- .../llvm/Transforms/Vectorize/SLPVectorizer.h | 3 +- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 1364 ++++++++++++++------ .../AArch64/transpose-inseltpoison.ll | 84 +- .../Transforms/SLPVectorizer/AArch64/transpose.ll | 84 +- llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 42 +- .../Transforms/SLPVectorizer/X86/crash_cmpop.ll | 6 +- llvm/test/Transforms/SLPVectorizer/X86/extract.ll | 6 +- .../SLPVectorizer/X86/jumbled-load-multiuse.ll | 12 +- .../Transforms/SLPVectorizer/X86/jumbled-load.ll | 22 +- .../SLPVectorizer/X86/jumbled_store_crash.ll | 29 +- .../SLPVectorizer/X86/reorder_repeated_ops.ll | 4 +- .../SLPVectorizer/X86/split-load8_2-unord.ll | 4 +- .../X86/vectorize-reorder-alt-shuffle.ll | 9 +- .../SLPVectorizer/X86/vectorize-reorder-reuse.ll | 52 +- 14 files changed, 1119 insertions(+), 602 deletions(-) diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h index f416a592d683..5e8c29913cad 100644 --- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h +++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h @@ -95,8 +95,7 @@ private: /// Try to vectorize a list of operands. /// \returns true if a value was vectorized. - bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R, - bool AllowReorder = false); + bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R); /// Try to vectorize a chain that may start at the operands of \p I. bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R); diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index 9c0029484964..7400b3d8a503 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -21,6 +21,7 @@ #include "llvm/ADT/DenseSet.h" #include "llvm/ADT/Optional.h" #include "llvm/ADT/PostOrderIterator.h" +#include "llvm/ADT/PriorityQueue.h" #include "llvm/ADT/STLExtras.h" #include "llvm/ADT/SetOperations.h" #include "llvm/ADT/SetVector.h" @@ -535,13 +536,68 @@ static bool isSimple(Instruction *I) { return true; } +/// Shuffles \p Mask in accordance with the given \p SubMask. +static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) { + if (SubMask.empty()) + return; + if (Mask.empty()) { + Mask.append(SubMask.begin(), SubMask.end()); + return; + } + SmallVector<int> NewMask(SubMask.size(), UndefMaskElem); + int TermValue = std::min(Mask.size(), SubMask.size()); + for (int I = 0, E = SubMask.size(); I < E; ++I) { + if (SubMask[I] >= TermValue || SubMask[I] == UndefMaskElem || + Mask[SubMask[I]] >= TermValue) + continue; + NewMask[I] = Mask[SubMask[I]]; + } + Mask.swap(NewMask); +} + +/// Order may have elements assigned special value (size) which is out of +/// bounds. Such indices only appear on places which correspond to undef values +/// (see canReuseExtract for details) and used in order to avoid undef values +/// have effect on operands ordering. +/// The first loop below simply finds all unused indices and then the next loop +/// nest assigns these indices for undef values positions. +/// As an example below Order has two undef positions and they have assigned +/// values 3 and 7 respectively: +/// before: 6 9 5 4 9 2 1 0 +/// after: 6 3 5 4 7 2 1 0 +/// \returns Fixed ordering. +static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) { + const unsigned Sz = Order.size(); + SmallBitVector UsedIndices(Sz); + SmallVector<int> MaskedIndices; + for (unsigned I = 0; I < Sz; ++I) { + if (Order[I] < Sz) + UsedIndices.set(Order[I]); + else + MaskedIndices.push_back(I); + } + if (MaskedIndices.empty()) + return; + SmallVector<int> AvailableIndices(MaskedIndices.size()); + unsigned Cnt = 0; + int Idx = UsedIndices.find_first(); + do { + AvailableIndices[Cnt] = Idx; + Idx = UsedIndices.find_next(Idx); + ++Cnt; + } while (Idx > 0); + assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices."); + for (int I = 0, E = MaskedIndices.size(); I < E; ++I) + Order[MaskedIndices[I]] = AvailableIndices[I]; +} + namespace llvm { static void inversePermutation(ArrayRef<unsigned> Indices, SmallVectorImpl<int> &Mask) { Mask.clear(); const unsigned E = Indices.size(); - Mask.resize(E, E + 1); + Mask.resize(E, UndefMaskElem); for (unsigned I = 0; I < E; ++I) Mask[Indices[I]] = I; } @@ -581,6 +637,22 @@ static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) { return Index; } +/// Reorders the list of scalars in accordance with the given \p Order and then +/// the \p Mask. \p Order - is the original order of the scalars, need to +/// reorder scalars into an unordered state at first according to the given +/// order. Then the ordered scalars are shuffled once again in accordance with +/// the provided mask. +static void reorderScalars(SmallVectorImpl<Value *> &Scalars, + ArrayRef<int> Mask) { + assert(!Mask.empty() && "Expected non-empty mask."); + SmallVector<Value *> Prev(Scalars.size(), + UndefValue::get(Scalars.front()->getType())); + Prev.swap(Scalars); + for (unsigned I = 0, E = Prev.size(); I < E; ++I) + if (Mask[I] != UndefMaskElem) + Scalars[Mask[I]] = Prev[I]; +} + namespace slpvectorizer { /// Bottom Up SLP Vectorizer. @@ -645,13 +717,12 @@ public: void buildTree(ArrayRef<Value *> Roots, ArrayRef<Value *> UserIgnoreLst = None); - /// Construct a vectorizable tree that starts at \p Roots, ignoring users for - /// the purpose of scheduling and extraction in the \p UserIgnoreLst taking - /// into account (and updating it, if required) list of externally used - /// values stored in \p ExternallyUsedValues. - void buildTree(ArrayRef<Value *> Roots, - ExtraValueToDebugLocsMap &ExternallyUsedValues, - ArrayRef<Value *> UserIgnoreLst = None); + /// Builds external uses of the vectorized scalars, i.e. the list of + /// vectorized scalars to be extracted, their lanes and their scalar users. \p + /// ExternallyUsedValues contains additional list of external uses to handle + /// vectorization of reductions. + void + buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {}); /// Clear the internal data structures that are created by 'buildTree'. void deleteTree() { @@ -659,8 +730,6 @@ public: ScalarToTreeEntry.clear(); MustGather.clear(); ExternalUses.clear(); - NumOpsWantToKeepOrder.clear(); - NumOpsWantToKeepOriginalOrder = 0; for (auto &Iter : BlocksSchedules) { BlockScheduling *BS = Iter.second.get(); BS->clear(); @@ -674,103 +743,22 @@ public: /// Perform LICM and CSE on the newly generated gather sequences. void optimizeGatherSequence(); - /// \returns The best order of instructions for vectorization. - Optional<ArrayRef<unsigned>> bestOrder() const { - assert(llvm::all_of( - NumOpsWantToKeepOrder, - [this](const decltype(NumOpsWantToKeepOrder)::value_type &D) { - return D.getFirst().size() == - VectorizableTree[0]->Scalars.size(); - }) && - "All orders must have the same size as number of instructions in " - "tree node."); - auto I = std::max_element( - NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(), - [](const decltype(NumOpsWantToKeepOrder)::value_type &D1, - const decltype(NumOpsWantToKeepOrder)::value_type &D2) { - return D1.second < D2.second; - }); - if (I == NumOpsWantToKeepOrder.end() || - I->getSecond() <= NumOpsWantToKeepOriginalOrder) - return None; - - return makeArrayRef(I->getFirst()); - } - - /// Builds the correct order for root instructions. - /// If some leaves have the same instructions to be vectorized, we may - /// incorrectly evaluate the best order for the root node (it is built for the - /// vector of instructions without repeated instructions and, thus, has less - /// elements than the root node). This function builds the correct order for - /// the root node. - /// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves - /// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first - /// leaf, it will be shrink to \<a, b\>. If instructions in this leaf should - /// be reordered, the best order will be \<1, 0\>. We need to extend this - /// order for the root node. For the root node this order should look like - /// \<3, 0, 1, 2\>. This function extends the order for the reused - /// instructions. - void findRootOrder(OrdersType &Order) { - // If the leaf has the same number of instructions to vectorize as the root - // - order must be set already. - unsigned RootSize = VectorizableTree[0]->Scalars.size(); - if (Order.size() == RootSize) - return; - SmallVector<unsigned, 4> RealOrder(Order.size()); - std::swap(Order, RealOrder); - SmallVector<int, 4> Mask; - inversePermutation(RealOrder, Mask); - Order.assign(Mask.begin(), Mask.end()); - // The leaf has less number of instructions - need to find the true order of - // the root. - // Scan the nodes starting from the leaf back to the root. - const TreeEntry *PNode = VectorizableTree.back().get(); - SmallVector<const TreeEntry *, 4> Nodes(1, PNode); - SmallPtrSet<const TreeEntry *, 4> Visited; - while (!Nodes.empty() && Order.size() != RootSize) { - const TreeEntry *PNode = Nodes.pop_back_val(); - if (!Visited.insert(PNode).second) - continue; - const TreeEntry &Node = *PNode; - for (const EdgeInfo &EI : Node.UserTreeIndices) - if (EI.UserTE) - Nodes.push_back(EI.UserTE); - if (Node.ReuseShuffleIndices.empty()) - continue; - // Build the order for the parent node. - OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize); - SmallVector<unsigned, 4> OrderCounter(Order.size(), 0); - // The algorithm of the order extension is: - // 1. Calculate the number of the same instructions for the order. - // 2. Calculate the index of the new order: total number of instructions - // with order less than the order of the current instruction + reuse - // number of the current instruction. - // 3. The new order is just the index of the instruction in the original - // vector of the instructions. - for (unsigned I : Node.ReuseShuffleIndices) - ++OrderCounter[Order[I]]; - SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0); - for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) { - unsigned ReusedIdx = Node.ReuseShuffleIndices[I]; - unsigned OrderIdx = Order[ReusedIdx]; - unsigned NewIdx = 0; - for (unsigned J = 0; J < OrderIdx; ++J) - NewIdx += OrderCounter[J]; - NewIdx += CurrentCounter[OrderIdx]; - ++CurrentCounter[OrderIdx]; - assert(NewOrder[NewIdx] == RootSize && - "The order index should not be written already."); - NewOrder[NewIdx] = I; - } - std::swap(Order, NewOrder); - } - assert(Order.size() == RootSize && - "Root node is expected or the size of the order must be the same as " - "the number of elements in the root node."); - assert(llvm::all_of(Order, - [RootSize](unsigned Val) { return Val != RootSize; }) && - "All indices must be initialized"); - } + /// Reorders the current graph to the most profitable order starting from the + /// root node to the leaf nodes. The best order is chosen only from the nodes + /// of the same size (vectorization factor). Smaller nodes are considered + /// parts of subgraph with smaller VF and they are reordered independently. We + /// can make it because we still need to extend smaller nodes to the wider VF + /// and we can merge reordering shuffles with the widening shuffles. + void reorderTopToBottom(); + + /// Reorders the current graph to the most profitable order starting from + /// leaves to the root. It allows to rotate small subgraphs and reduce the + /// number of reshuffles if the leaf nodes use the same order. In this case we + /// can merge the orders and just shuffle user node instead of shuffling its + /// operands. Plus, even the leaf nodes have different orders, it allows to + /// sink reordering in the graph closer to the root node and merge it later + /// during analysis. + void reorderBottomToTop(); /// \return The vector element size in bits to use when vectorizing the /// expression tree ending at \p V. If V is a store, the size is the width of @@ -793,6 +781,10 @@ public: return MinVecRegSize; } + unsigned getMinVF(unsigned Sz) const { + return std::max(2U, getMinVecRegSize() / Sz); + } + unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { unsigned MaxVF = MaxVFOption.getNumOccurrences() ? MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode); @@ -1621,12 +1613,29 @@ private: /// \returns true if the scalars in VL are equal to this entry. bool isSame(ArrayRef<Value *> VL) const { - if (VL.size() == Scalars.size()) - return std::equal(VL.begin(), VL.end(), Scalars.begin()); - return VL.size() == ReuseShuffleIndices.size() && - std::equal( - VL.begin(), VL.end(), ReuseShuffleIndices.begin(), - [this](Value *V, int Idx) { return V == Scalars[Idx]; }); + auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) { + if (Mask.size() != VL.size() && VL.size() == Scalars.size()) + return std::equal(VL.begin(), VL.end(), Scalars.begin()); + return VL.size() == Mask.size() && + std::equal( + VL.begin(), VL.end(), Mask.begin(), + [Scalars](Value *V, int Idx) { return V == Scalars[Idx]; }); + }; + if (!ReorderIndices.empty()) { + // TODO: implement matching if the nodes are just reordered, still can + // treat the vector as the same if the list of scalars matches VL + // directly, without reordering. + SmallVector<int> Mask; + inversePermutation(ReorderIndices, Mask); + if (VL.size() == Scalars.size()) + return IsSame(Scalars, Mask); + if (VL.size() == ReuseShuffleIndices.size()) { + ::addMask(Mask, ReuseShuffleIndices); + return IsSame(Scalars, Mask); + } + return false; + } + return IsSame(Scalars, ReuseShuffleIndices); } /// A vector of scalars. @@ -1701,6 +1710,12 @@ private: } } + /// Reorders operands of the node to the given mask \p Mask. + void reorderOperands(ArrayRef<int> Mask) { + for (ValueList &Operand : Operands) + reorderScalars(Operand, Mask); + } + /// \returns the \p OpIdx operand of this TreeEntry. ValueList &getOperand(unsigned OpIdx) { assert(OpIdx < Operands.size() && "Off bounds"); @@ -1760,19 +1775,14 @@ private: return AltOp ? AltOp->getOpcode() : 0; } - /// Update operations state of this entry if reorder occurred. - bool updateStateIfReorder() { - if (ReorderIndices.empty()) - return false; - InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front()); - setOperations(S); - return true; - } - /// When ReuseShuffleIndices is empty it just returns position of \p V - /// within vector of Scalars. Otherwise, try to remap on its reuse index. + /// When ReuseReorderShuffleIndices is empty it just returns position of \p + /// V within vector of Scalars. Otherwise, try to remap on its reuse index. int findLaneForValue(Value *V) const { unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V)); assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); + if (!ReorderIndices.empty()) + FoundLane = ReorderIndices[FoundLane]; + assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); if (!ReuseShuffleIndices.empty()) { FoundLane = std::distance(ReuseShuffleIndices.begin(), find(ReuseShuffleIndices, FoundLane)); @@ -1856,7 +1866,7 @@ private: TreeEntry *newTreeEntry(ArrayRef<Value *> VL, Optional<ScheduleData *> Bundle, const InstructionsState &S, const EdgeInfo &UserTreeIdx, - ArrayRef<unsigned> ReuseShuffleIndices = None, + ArrayRef<int> ReuseShuffleIndices = None, ArrayRef<unsigned> ReorderIndices = None) { TreeEntry::EntryState EntryState = Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather; @@ -1869,7 +1879,7 @@ private: Optional<ScheduleData *> Bundle, const InstructionsState &S, const EdgeInfo &UserTreeIdx, - ArrayRef<unsigned> ReuseShuffleIndices = None, + ArrayRef<int> ReuseShuffleIndices = None, ArrayRef<unsigned> ReorderIndices = None) { assert(((!Bundle && EntryState == TreeEntry::NeedToGather) || (Bundle && EntryState != TreeEntry::NeedToGather)) && @@ -1877,12 +1887,25 @@ private: VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree)); TreeEntry *Last = VectorizableTree.back().get(); Last->Idx = VectorizableTree.size() - 1; - Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end()); Last->State = EntryState; Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(), ReuseShuffleIndices.end()); - Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end()); - Last->setOperations(S); + if (ReorderIndices.empty()) { + Last->Scalars.assign(VL.begin(), VL.end()); + Last->setOperations(S); + } else { + // Reorder scalars and build final mask. + Last->Scalars.assign(VL.size(), nullptr); + transform(ReorderIndices, Last->Scalars.begin(), + [VL](unsigned Idx) -> Value * { + if (Idx >= VL.size()) + return UndefValue::get(VL.front()->getType()); + return VL[Idx]; + }); + InstructionsState S = getSameOpcode(Last->Scalars); + Last->setOperations(S); + Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end()); + } if (Last->State != TreeEntry::NeedToGather) { for (Value *V : VL) { assert(!getTreeEntry(V) && "Scalar already in tree!"); @@ -2431,14 +2454,6 @@ private: } }; - /// Contains orders of operations along with the number of bundles that have - /// operations in this order. It stores only those orders that require - /// reordering, if reordering is not required it is counted using \a - /// NumOpsWantToKeepOriginalOrder. - DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder; - /// Number of bundles that do not require reordering. - unsigned NumOpsWantToKeepOriginalOrder = 0; - // Analysis and block reference. Function *F; ScalarEvolution *SE; @@ -2591,21 +2606,439 @@ void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) { }; } -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, - ArrayRef<Value *> UserIgnoreLst) { - ExtraValueToDebugLocsMap ExternallyUsedValues; - buildTree(Roots, ExternallyUsedValues, UserIgnoreLst); +/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses +/// contains original mask for the scalars reused in the node. Procedure +/// transform this mask in accordance with the given \p Mask. +static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) { + assert(!Mask.empty() && Reuses.size() == Mask.size() && + "Expected non-empty mask."); + SmallVector<int> Prev(Reuses.begin(), Reuses.end()); + Prev.swap(Reuses); + for (unsigned I = 0, E = Prev.size(); I < E; ++I) + if (Mask[I] != UndefMaskElem) + Reuses[Mask[I]] = Prev[I]; } -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, - ExtraValueToDebugLocsMap &ExternallyUsedValues, - ArrayRef<Value *> UserIgnoreLst) { - deleteTree(); - UserIgnoreList = UserIgnoreLst; - if (!allSameType(Roots)) +/// Reorders the given \p Order according to the given \p Mask. \p Order - is +/// the original order of the scalars. Procedure transforms the provided order +/// in accordance with the given \p Mask. If the resulting \p Order is just an +/// identity order, \p Order is cleared. +static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> Mask) { + assert(!Mask.empty() && "Expected non-empty mask."); + SmallVector<int> MaskOrder; + if (Order.empty()) { + MaskOrder.resize(Mask.size()); + std::iota(MaskOrder.begin(), MaskOrder.end(), 0); + } else { + inversePermutation(Order, MaskOrder); + } + reorderReuses(MaskOrder, Mask); + if (ShuffleVectorInst::isIdentityMask(MaskOrder)) { + Order.clear(); return; - buildTree_rec(Roots, 0, EdgeInfo()); + } + Order.assign(Mask.size(), Mask.size()); + for (unsigned I = 0, E = Mask.size(); I < E; ++I) + if (MaskOrder[I] != UndefMaskElem) + Order[MaskOrder[I]] = I; + fixupOrderingIndices(Order); +} + +void BoUpSLP::reorderTopToBottom() { + // Maps VF to the graph nodes. + DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries; + // ExtractElement gather nodes which can be vectorized and need to handle + // their ordering. + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; + // Find all reorderable nodes with the given VF. + // Currently the are vectorized loads,extracts + some gathering of extracts. + for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( + const std::unique_ptr<TreeEntry> &TE) { + // No need to reorder if need to shuffle reuses, still need to shuffle the + // node. + if (!TE->ReuseShuffleIndices.empty()) + return; + if (TE->State == TreeEntry::Vectorize && + isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst, + InsertElementInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); + } else if (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement && + !TE->isAltShuffle() && + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) + ->getVectorOperandType()) && + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { + // Check that gather of extractelements can be represented as + // just a shuffle of a single vector. + OrdersType CurrentOrder; + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder); + if (Reuse || !CurrentOrder.empty()) { + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); + GathersToOrders.try_emplace(TE.get(), CurrentOrder); + } + } + }); + + // Reorder the graph nodes according to their vectorization factor. + for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1; + VF /= 2) { + auto It = VFToOrderedEntries.find(VF); + if (It == VFToOrderedEntries.end()) + continue; + // Try to find the most profitable order. We just are looking for the most + // used order and reorder scalar elements in the nodes according to this + // mostly used order. + const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond(); + // All operands are reordered and used only in this node - propagate the + // most used order to the user node. + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; + SmallPtrSet<const TreeEntry *, 4> VisitedOps; + for (const TreeEntry *OpTE : OrderedEntries) { + // No need to reorder this nodes, still need to extend and to use shuffle, + // just need to merge reordering shuffle and the reuse shuffle. + if (!OpTE->ReuseShuffleIndices.empty()) + continue; + // Count number of orders uses. + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & { + if (OpTE->State == TreeEntry::NeedToGather) + return GathersToOrders.find(OpTE)->second; + return OpTE->ReorderIndices; + }(); + // Stores actually store the mask, not the order, need to invert. + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { + SmallVector<int> Mask; + inversePermutation(Order, Mask); + unsigned E = Order.size(); + OrdersType CurrentOrder(E, E); + transform(Mask, CurrentOrder.begin(), [E](int Idx) { + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); + }); + fixupOrderingIndices(CurrentOrder); + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); + } else { + ++OrdersUses.try_emplace(Order).first->getSecond(); + } + } + // Set order of the user node. + if (OrdersUses.empty()) + continue; + // Choose the most used order. + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; + unsigned Cnt = OrdersUses.begin()->second; + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) { + BestOrder = Pair.first; + Cnt = Pair.second; + } + } + // Set order of the user node. + if (BestOrder.empty()) + continue; + SmallVector<int> Mask; + inversePermutation(BestOrder, Mask); + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); + unsigned E = BestOrder.size(); + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { + return I < E ? static_cast<int>(I) : UndefMaskElem; + }); + // Do an actual reordering, if profitable. + for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) { + // Just do the reordering for the nodes with the given VF. + if (TE->Scalars.size() != VF) { + if (TE->ReuseShuffleIndices.size() == VF) { + // Need to reorder the reuses masks of the operands with smaller VF to + // be able to find the match between the graph nodes and scalar + // operands of the given node during vectorization/cost estimation. + assert(all_of(TE->UserTreeIndices, + [VF, &TE](const EdgeInfo &EI) { + return EI.UserTE->Scalars.size() == VF || + EI.UserTE->Scalars.size() == + TE->Scalars.size(); + }) && + "All users must be of VF size."); + // Update ordering of the operands with the smaller VF than the given + // one. + reorderReuses(TE->ReuseShuffleIndices, Mask); + } + continue; + } + if (TE->State == TreeEntry::Vectorize && + isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst, + InsertElementInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + // Build correct orders for extract{element,value}, loads and + // stores. + reorderOrder(TE->ReorderIndices, Mask); + if (isa<InsertElementInst, StoreInst>(TE->getMainOp())) + TE->reorderOperands(Mask); + } else { + // Reorder the node and its operands. + TE->reorderOperands(Mask); + assert(TE->ReorderIndices.empty() && + "Expected empty reorder sequence."); + reorderScalars(TE->Scalars, Mask); + } + if (!TE->ReuseShuffleIndices.empty()) { + // Apply reversed order to keep the original ordering of the reused + // elements to avoid extra reorder indices shuffling. + OrdersType CurrentOrder; + reorderOrder(CurrentOrder, MaskOrder); + SmallVector<int> NewReuses; + inversePermutation(CurrentOrder, NewReuses); + addMask(NewReuses, TE->ReuseShuffleIndices); + TE->ReuseShuffleIndices.swap(NewReuses); + } + } + } +} + +void BoUpSLP::reorderBottomToTop() { + SetVector<TreeEntry *> OrderedEntries; + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; + // Find all reorderable leaf nodes with the given VF. + // Currently the are vectorized loads,extracts without alternate operands + + // some gathering of extracts. + SmallVector<TreeEntry *> NonVectorized; + for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders, + &NonVectorized]( + const std::unique_ptr<TreeEntry> &TE) { + // No need to reorder if need to shuffle reuses, still need to shuffle the + // node. + if (!TE->ReuseShuffleIndices.empty()) + return; + if (TE->State == TreeEntry::Vectorize && + isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + OrderedEntries.insert(TE.get()); + } else if (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement && + !TE->isAltShuffle() && + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) + ->getVectorOperandType()) && + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { + // Check that gather of extractelements can be represented as + // just a shuffle of a single vector with a single user only. + OrdersType CurrentOrder; + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder); + if ((Reuse || !CurrentOrder.empty()) && + !any_of( + VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> &Entry) { + return Entry->State == TreeEntry::NeedToGather && + Entry.get() != TE.get() && Entry->isSame(TE->Scalars); + })) { + OrderedEntries.insert(TE.get()); + GathersToOrders.try_emplace(TE.get(), CurrentOrder); + } + } + if (TE->State != TreeEntry::Vectorize) + NonVectorized.push_back(TE.get()); + }); + + // Checks if the operands of the users are reordarable and have only single + // use. + auto &&CheckOperands = + [this, &NonVectorized](const auto &Data, + SmallVectorImpl<TreeEntry *> &GatherOps) { + for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) { + if (any_of(Data.second, + [I](const std::pair<unsigned, TreeEntry *> &OpData) { + return OpData.first == I && + OpData.second->State == TreeEntry::Vectorize; + })) + continue; + ArrayRef<Value *> VL = Data.first->getOperand(I); + const TreeEntry *TE = nullptr; + const auto *It = find_if(VL, [this, &TE](Value *V) { + TE = getTreeEntry(V); + return TE; + }); + if (It != VL.end() && TE->isSame(VL)) + return false; + TreeEntry *Gather = nullptr; + if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) { + assert(TE->State != TreeEntry::Vectorize && + "Only non-vectorized nodes are expected."); + if (TE->isSame(VL)) { + Gather = TE; + return true; + } + return false; + }) > 1) + return false; + if (Gather) + GatherOps.push_back(Gather); + } + return true; + }; + // 1. Propagate order to the graph nodes, which use only reordered nodes. + // I.e., if the node has operands, that are reordered, try to make at least + // one operand order in the natural order and reorder others + reorder the + // user node itself. + SmallPtrSet<const TreeEntry *, 4> Visited; + while (!OrderedEntries.empty()) { + // 1. Filter out only reordered nodes. + // 2. If the entry has multiple uses - skip it and jump to the next node. + MapVector<TreeEntry *, SmallVector<std::pair<unsigned, TreeEntry *>>> Users; + SmallVector<TreeEntry *> Filtered; + for (TreeEntry *TE : OrderedEntries) { + if (!(TE->State == TreeEntry::Vectorize || + (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement)) || + TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() || + !all_of(drop_begin(TE->UserTreeIndices), + [TE](const EdgeInfo &EI) { + return EI.UserTE == TE->UserTreeIndices.front().UserTE; + }) || + !Visited.insert(TE).second) { + Filtered.push_back(TE); + continue; + } + // Build a map between user nodes and their operands order to speedup + // search. The graph currently does not provide this dependency directly. + for (EdgeInfo &EI : TE->UserTreeIndices) { + TreeEntry *UserTE = EI.UserTE; + auto It = Users.find(UserTE); + if (It == Users.end()) + It = Users.insert({UserTE, {}}).first; + It->second.emplace_back(EI.EdgeIdx, TE); + } + } + // Erase filtered entries. + for_each(Filtered, + [&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); }); + for (const auto &Data : Users) { + // Check that operands are used only in the User node. + SmallVector<TreeEntry *> GatherOps; + if (!CheckOperands(Data, GatherOps)) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // All operands are reordered and used only in this node - propagate the + // most used order to the user node. + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; + SmallPtrSet<const TreeEntry *, 4> VisitedOps; + for (const auto &Op : Data.second) { + TreeEntry *OpTE = Op.second; + if (!OpTE->ReuseShuffleIndices.empty()) + continue; + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & { + if (OpTE->State == TreeEntry::NeedToGather) + return GathersToOrders.find(OpTE)->second; + return OpTE->ReorderIndices; + }(); + // Stores actually store the mask, not the order, need to invert. + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { + SmallVector<int> Mask; + inversePermutation(Order, Mask); + unsigned E = Order.size(); + OrdersType CurrentOrder(E, E); + transform(Mask, CurrentOrder.begin(), [E](int Idx) { + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); + }); + fixupOrderingIndices(CurrentOrder); + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); + } else { + ++OrdersUses.try_emplace(Order).first->getSecond(); + } + if (VisitedOps.insert(OpTE).second) + OrdersUses.try_emplace({}, 0).first->getSecond() += + OpTE->UserTreeIndices.size(); + --OrdersUses[{}]; + } + // If no orders - skip current nodes and jump to the next one, if any. + if (OrdersUses.empty()) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // Choose the best order. + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; + unsigned Cnt = OrdersUses.begin()->second; + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) { + BestOrder = Pair.first; + Cnt = Pair.second; + } + } + // Set order of the user node (reordering of operands and user nodes). + if (BestOrder.empty()) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // Erase operands from OrderedEntries list and adjust their orders. + VisitedOps.clear(); + SmallVector<int> Mask; + inversePermutation(BestOrder, Mask); + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); + unsigned E = BestOrder.size(); + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { + return I < E ? static_cast<int>(I) : UndefMaskElem; + }); + for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) { + TreeEntry *TE = Op.second; + OrderedEntries.remove(TE); + if (!VisitedOps.insert(TE).second) + continue; + if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) { + // Just reorder reuses indices. + reorderReuses(TE->ReuseShuffleIndices, Mask); + continue; + } + // Gathers are processed separately. + if (TE->State != TreeEntry::Vectorize) + continue; + assert((BestOrder.size() == TE->ReorderIndices.size() || + TE->ReorderIndices.empty()) && + "Non-matching sizes of user/operand entries."); + reorderOrder(TE->ReorderIndices, Mask); + } + // For gathers just need to reorder its scalars. + for (TreeEntry *Gather : GatherOps) { + if (!Gather->ReuseShuffleIndices.empty()) + continue; + assert(Gather->ReorderIndices.empty() && + "Unexpected reordering of gathers."); + reorderScalars(Gather->Scalars, Mask); + OrderedEntries.remove(Gather); + } + // Reorder operands of the user node and set the ordering for the user + // node itself. + if (Data.first->State != TreeEntry::Vectorize || + !isa<ExtractElementInst, ExtractValueInst, LoadInst>( + Data.first->getMainOp()) || + Data.first->isAltShuffle()) + Data.first->reorderOperands(Mask); + if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) || + Data.first->isAltShuffle()) { + reorderScalars(Data.first->Scalars, Mask); + reorderOrder(Data.first->ReorderIndices, MaskOrder); + if (Data.first->ReuseShuffleIndices.empty() && + !Data.first->ReorderIndices.empty() && + !Data.first->isAltShuffle()) { + // Insert user node to the list to try to sink reordering deeper in + // the graph. + OrderedEntries.insert(Data.first); + } + } else { + reorderOrder(Data.first->ReorderIndices, Mask); + } + } + } +} +void BoUpSLP::buildExternalUses( + const ExtraValueToDebugLocsMap &ExternallyUsedValues) { // Collect the values that we need to extract from the tree. for (auto &TEPtr : VectorizableTree) { TreeEntry *Entry = TEPtr.get(); @@ -2664,6 +3097,80 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots, } } +void BoUpSLP::buildTree(ArrayRef<Value *> Roots, + ArrayRef<Value *> UserIgnoreLst) { + deleteTree(); + UserIgnoreList = UserIgnoreLst; + if (!allSameType(Roots)) + return; + buildTree_rec(Roots, 0, EdgeInfo()); +} + +namespace { +/// Tracks the state we can represent the loads in the given sequence. +enum class LoadsState { Gather, Vectorize, ScatterVectorize }; +} // anonymous namespace + +/// Checks if the given array of loads can be represented as a vectorized, +/// scatter or just simple gather. +static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0, + const TargetTransformInfo &TTI, + const DataLayout &DL, ScalarEvolution &SE, + SmallVectorImpl<unsigned> &Order, + SmallVectorImpl<Value *> &PointerOps) { + // Check that a vectorized load would load the same memory as a scalar + // load. For example, we don't want to vectorize loads that are smaller + // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM + // treats loading/storing it as an i8 struct. If we vectorize loads/stores + // from such a struct, we read/write packed bits disagreeing with the + // unvectorized version. + Type *ScalarTy = VL0->getType(); + + if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy)) + return LoadsState::Gather; + + // Make sure all loads in the bundle are simple - we can't vectorize + // atomic or volatile loads. + PointerOps.clear(); + PointerOps.resize(VL.size()); + auto *POIter = PointerOps.begin(); + for (Value *V : VL) { + auto *L = cast<LoadInst>(V); + if (!L->isSimple()) + return LoadsState::Gather; + *POIter = L->getPointerOperand(); + ++POIter; + } + + Order.clear(); + // Check the order of pointer operands. + if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) { + Value *Ptr0; + Value *PtrN; + if (Order.empty()) { + Ptr0 = PointerOps.front(); + PtrN = PointerOps.back(); + } else { + Ptr0 = PointerOps[Order.front()]; + PtrN = PointerOps[Order.back()]; + } + Optional<int> Diff = + getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE); + // Check that the sorted loads are consecutive. + if (static_cast<unsigned>(*Diff) == VL.size() - 1) + return LoadsState::Vectorize; + Align CommonAlignment = cast<LoadInst>(VL0)->getAlign(); </cut>

4 years, 5 months

[CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/llvm-master-arm-spec2k6-Os - Build # 9 - Successful!

by ci_notify＠linaro.org

Successfully identified regression in *llvm* in CI configuration tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-Os. So far, this commit has regressed CI configurations: - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-Os Culprit: <cut> commit 40b752d28d95158e52dba7cfeea92e41b7ccff9a Author: Sanjay Patel <spatel(a)rotateright.com> Date: Mon Jul 5 09:57:39 2021 -0400 [InstCombine] fold icmp slt/sgt of offset value with constant This follows up patches for the unsigned siblings: 0c400e895306 c7b658aeb526 We are translating an offset signed compare to its unsigned equivalent when one end of the range is at the limit (zero or unsigned max). (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) This probably does not show up much in IR derived from C/C++ source because that would likely have 'nsw', and we have folds for that already. As with the previous unsigned transforms, the folds could be generalized to handle non-constant patterns: https://alive2.llvm.org/ce/z/Y8Xrrm ; sgt define i1 @src(i8 %a, i8 %c) { %c2 = add i8 %c, 1 %t = add i8 %a, %c2 %ov = icmp sgt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_off = sub i8 127, %c ; SMAX %ov = icmp ult i8 %a, %c_off ret i1 %ov } https://alive2.llvm.org/ce/z/c8uhnk ; slt define i1 @src(i8 %a, i8 %c) { %t = add i8 %a, %c %ov = icmp slt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_offnot = xor i8 %c, 127 ; SMAX %ov = icmp ugt i8 %a, %c_offnot ret i1 %ov } </cut> Results regressed to (for first_bad == 40b752d28d95158e52dba7cfeea92e41b7ccff9a) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -Os_mthumb -- artifacts/build-40b752d28d95158e52dba7cfeea92e41b7ccff9a/results_id: 1 # 401.bzip2,bzip2_base.default regressed by 110 # 401.bzip2,[.] BZ2_decompress regressed by 149 from (for last_good == 32dd914f7182875730eb3453f39dcc584b7219b2) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -Os_mthumb -- artifacts/build-32dd914f7182875730eb3453f39dcc584b7219b2/results_id: 1 Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Results ID of last_good: tk1_32/tcwg_bmk_llvm_tk1/bisect-llvm-master-arm-spec2k6-Os/1631 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Results ID of first_bad: tk1_32/tcwg_bmk_llvm_tk1/bisect-llvm-master-arm-spec2k6-Os/1622 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Configuration details: Reproduce builds: <cut> mkdir investigate-llvm-40b752d28d95158e52dba7cfeea92e41b7ccff9a cd investigate-llvm-40b752d28d95158e52dba7cfeea92e41b7ccff9a git clone https://git.linaro.org/toolchain/jenkins-scripts mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) rsync -a --del --delete-excluded --exclude bisect/ --exclude artifacts/ --exclude llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 40b752d28d95158e52dba7cfeea92e41b7ccff9a ../artifacts/test.sh # Reproduce last_good build git checkout --detach 32dd914f7182875730eb3453f39dcc584b7219b2 ../artifacts/test.sh cd .. </cut> History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/… Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Build log: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Full commit (up to 1000 lines): <cut> commit 40b752d28d95158e52dba7cfeea92e41b7ccff9a Author: Sanjay Patel <spatel(a)rotateright.com> Date: Mon Jul 5 09:57:39 2021 -0400 [InstCombine] fold icmp slt/sgt of offset value with constant This follows up patches for the unsigned siblings: 0c400e895306 c7b658aeb526 We are translating an offset signed compare to its unsigned equivalent when one end of the range is at the limit (zero or unsigned max). (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) This probably does not show up much in IR derived from C/C++ source because that would likely have 'nsw', and we have folds for that already. As with the previous unsigned transforms, the folds could be generalized to handle non-constant patterns: https://alive2.llvm.org/ce/z/Y8Xrrm ; sgt define i1 @src(i8 %a, i8 %c) { %c2 = add i8 %c, 1 %t = add i8 %a, %c2 %ov = icmp sgt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_off = sub i8 127, %c ; SMAX %ov = icmp ult i8 %a, %c_off ret i1 %ov } https://alive2.llvm.org/ce/z/c8uhnk ; slt define i1 @src(i8 %a, i8 %c) { %t = add i8 %a, %c %ov = icmp slt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_offnot = xor i8 %c, 127 ; SMAX %ov = icmp ugt i8 %a, %c_offnot ret i1 %ov } --- .../Transforms/InstCombine/InstCombineCompares.cpp | 21 ++++++++++++++------- llvm/test/Transforms/InstCombine/icmp-add.ll | 20 ++++++++++---------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp index 6bd479def210..6e66c61f5e46 100644 --- a/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp +++ b/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp @@ -2636,20 +2636,27 @@ Instruction *InstCombinerImpl::foldICmpAddConstant(ICmpInst &Cmp, // Fold icmp pred (add X, C2), C. Value *X = Add->getOperand(0); Type *Ty = Add->getType(); - CmpInst::Predicate Pred = Cmp.getPredicate(); + const CmpInst::Predicate Pred = Cmp.getPredicate(); + const APInt SMax = APInt::getSignedMaxValue(Ty->getScalarSizeInBits()); + const APInt SMin = APInt::getSignedMinValue(Ty->getScalarSizeInBits()); - // Fold an unsigned compare with offset to signed compare: + // Fold compare with offset to opposite sign compare if it eliminates offset: // (X + C2) >u C --> X <s -C2 (if C == C2 + SMAX) - // TODO: Find the signed predicate siblings. - if (Pred == CmpInst::ICMP_UGT && - C == *C2 + APInt::getSignedMaxValue(Ty->getScalarSizeInBits())) + if (Pred == CmpInst::ICMP_UGT && C == *C2 + SMax) return new ICmpInst(ICmpInst::ICMP_SLT, X, ConstantInt::get(Ty, -(*C2))); // (X + C2) <u C --> X >s ~C2 (if C == C2 + SMIN) - if (Pred == CmpInst::ICMP_ULT && - C == *C2 + APInt::getSignedMinValue(Ty->getScalarSizeInBits())) + if (Pred == CmpInst::ICMP_ULT && C == *C2 + SMin) return new ICmpInst(ICmpInst::ICMP_SGT, X, ConstantInt::get(Ty, ~(*C2))); + // (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) + if (Pred == CmpInst::ICMP_SGT && C == *C2 - 1) + return new ICmpInst(ICmpInst::ICMP_ULT, X, ConstantInt::get(Ty, SMax - C)); + + // (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) + if (Pred == CmpInst::ICMP_SLT && C == *C2) + return new ICmpInst(ICmpInst::ICMP_UGT, X, ConstantInt::get(Ty, C ^ SMax)); + // If the add does not wrap, we can always adjust the compare by subtracting // the constants. Equality comparisons are handled elsewhere. SGE/SLE/UGE/ULE // are canonicalized to SGT/SLT/UGT/ULT. diff --git a/llvm/test/Transforms/InstCombine/icmp-add.ll b/llvm/test/Transforms/InstCombine/icmp-add.ll index 1f00dc6e2992..aa69325d716c 100644 --- a/llvm/test/Transforms/InstCombine/icmp-add.ll +++ b/llvm/test/Transforms/InstCombine/icmp-add.ll @@ -842,8 +842,7 @@ define i1 @ult_wrong_offset(i8 %a) { define i1 @sgt_offset(i8 %a) { ; CHECK-LABEL: @sgt_offset( -; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 -; CHECK-NEXT: [[OV:%.*]] = icmp sgt i8 [[T]], -7 +; CHECK-NEXT: [[OV:%.*]] = icmp ult i8 [[A:%.*]], -122 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i8 %a, -6 @@ -855,7 +854,7 @@ define i1 @sgt_offset_use(i32 %a) { ; CHECK-LABEL: @sgt_offset_use( ; CHECK-NEXT: [[T:%.*]] = add i32 [[A:%.*]], 42 ; CHECK-NEXT: call void @use(i32 [[T]]) -; CHECK-NEXT: [[OV:%.*]] = icmp sgt i32 [[T]], 41 +; CHECK-NEXT: [[OV:%.*]] = icmp ult i32 [[A]], 2147483606 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i32 %a, 42 @@ -866,8 +865,7 @@ define i1 @sgt_offset_use(i32 %a) { define <2 x i1> @sgt_offset_splat(<2 x i5> %a) { ; CHECK-LABEL: @sgt_offset_splat( -; CHECK-NEXT: [[T:%.*]] = add <2 x i5> [[A:%.*]], <i5 9, i5 9> -; CHECK-NEXT: [[OV:%.*]] = icmp sgt <2 x i5> [[T]], <i5 8, i5 8> +; CHECK-NEXT: [[OV:%.*]] = icmp ult <2 x i5> [[A:%.*]], <i5 7, i5 7> ; CHECK-NEXT: ret <2 x i1> [[OV]] ; %t = add <2 x i5> %a, <i5 9, i5 9> @@ -875,6 +873,8 @@ define <2 x i1> @sgt_offset_splat(<2 x i5> %a) { ret <2 x i1> %ov } +; negative test - constants must differ by 1 + define i1 @sgt_wrong_offset(i8 %a) { ; CHECK-LABEL: @sgt_wrong_offset( ; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -7 @@ -888,8 +888,7 @@ define i1 @sgt_wrong_offset(i8 %a) { define i1 @slt_offset(i8 %a) { ; CHECK-LABEL: @slt_offset( -; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 -; CHECK-NEXT: [[OV:%.*]] = icmp slt i8 [[T]], -6 +; CHECK-NEXT: [[OV:%.*]] = icmp ugt i8 [[A:%.*]], -123 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i8 %a, -6 @@ -901,7 +900,7 @@ define i1 @slt_offset_use(i32 %a) { ; CHECK-LABEL: @slt_offset_use( ; CHECK-NEXT: [[T:%.*]] = add i32 [[A:%.*]], 42 ; CHECK-NEXT: call void @use(i32 [[T]]) -; CHECK-NEXT: [[OV:%.*]] = icmp slt i32 [[T]], 42 +; CHECK-NEXT: [[OV:%.*]] = icmp ugt i32 [[A]], 2147483605 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i32 %a, 42 @@ -912,8 +911,7 @@ define i1 @slt_offset_use(i32 %a) { define <2 x i1> @slt_offset_splat(<2 x i5> %a) { ; CHECK-LABEL: @slt_offset_splat( -; CHECK-NEXT: [[T:%.*]] = add <2 x i5> [[A:%.*]], <i5 9, i5 9> -; CHECK-NEXT: [[OV:%.*]] = icmp slt <2 x i5> [[T]], <i5 9, i5 9> +; CHECK-NEXT: [[OV:%.*]] = icmp ugt <2 x i5> [[A:%.*]], <i5 6, i5 6> ; CHECK-NEXT: ret <2 x i1> [[OV]] ; %t = add <2 x i5> %a, <i5 9, i5 9> @@ -921,6 +919,8 @@ define <2 x i1> @slt_offset_splat(<2 x i5> %a) { ret <2 x i1> %ov } +; negative test - constants must be equal + define i1 @slt_wrong_offset(i8 %a) { ; CHECK-LABEL: @slt_wrong_offset( ; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 </cut>

4 years, 5 months

[ACTIVITY] week ending Oct. 24 2021

by Alex Bennée

And the rest of the week I flushed my maintainer queues ;-) Other ===== [update-ticket] <file:~/org/team.org::update-ticket> Update [update-ticket] to work with cloud JIRA Completed Reviews [8/8] ======================= [PATCH 0/7] tests: docker images for hexagon, nios2, microblaze Message-Id: <20211014224435.2539547-1-richard.henderson(a)linaro.org> [PATCH] gdbstub: Switch to the thread receiving a signal Message-Id: <20210930095111.23205-1-pavel(a)labath.sk> [PATCH] replay: improve determinism of virtio-net Message-Id: <162125666020.1252655.9997723318921206001.stgit@pasha-ThinkPad-X280> [PATCH RESEND v3 0/2] add APIs to handle alternative sNaN propagation for fmax/fmin Message-Id: <20211015065500.3850513-1-frank.chang(a)sifive.com> [PATCH v3 0/5] plugins/cache: multicore cache modelling and minor tweaks Message-Id: <20210722065428.134608-1-ma.mandourr(a)gmail.com> [PATCH v2 0/2] plugins: add a drcov plugin Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name> [PATCH v2 0/2] plugins: add a drcov plugin Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name> [PATCH 0/3] KVM: qemu patches for few KVM features I developed Message-Id: <20210914155214.105415-1-mlevitsk(a)redhat.com> Absences ======== - Off Friday next week Current Review Queue ==================== TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org> ================================================================================================================================ TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 5 months

[TCWG CI] Regression caused by gcc: Disallow loop rotation and loop header crossing in jump threaders.

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: Disallow loop rotation and loop header crossing in jump threaders.: commit d8edfadfc7a9795b65177a50ce44fd348858e844 Author: Aldy Hernandez <aldyh(a)redhat.com> Disallow loop rotation and loop header crossing in jump threaders. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21314 # First few build errors in logs: # 00:03:05 ./include/linux/spi/spi.h:1248:28: error: ‘msg’ is used uninitialized [-Werror=uninitialized] # 00:03:07 make[2]: *** [scripts/Makefile.build:277: drivers/bus/moxtet.o] Error 1 # 00:03:11 make[1]: *** [scripts/Makefile.build:540: drivers/bus] Error 2 # 00:03:23 sound/core/oss/mixer_oss.c:1035:21: error: ‘slot’ is used uninitialized [-Werror=uninitialized] # 00:03:26 sound/core/seq/oss/seq_oss_init.c:350:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:26 sound/core/seq/oss/seq_oss_init.c:370:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:26 make[4]: *** [scripts/Makefile.build:277: sound/core/seq/oss/seq_oss_init.o] Error 1 # 00:03:27 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/mixer_oss.o] Error 1 # 00:03:29 sound/core/oss/pcm_oss.c:2475:34: error: ‘setup’ is used uninitialized [-Werror=uninitialized] # 00:03:29 sound/core/oss/pcm_oss.c:108:29: error: ‘t’ is used uninitialized [-Werror=uninitialized] from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21326 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-mainline-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Reproduce builds: <cut> mkdir investigate-gcc-d8edfadfc7a9795b65177a50ce44fd348858e844 cd investigate-gcc-d8edfadfc7a9795b65177a50ce44fd348858e844 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach d8edfadfc7a9795b65177a50ce44fd348858e844 ../artifacts/test.sh # Reproduce last_good build git checkout --detach f36240f8c835d792f788b6724e272fc0a4a4f26f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit d8edfadfc7a9795b65177a50ce44fd348858e844 Author: Aldy Hernandez <aldyh(a)redhat.com> Date: Mon Oct 4 09:47:02 2021 +0200 Disallow loop rotation and loop header crossing in jump threaders. There is a lot of fall-out from this patch, as there were many threading tests that assumed the restrictions introduced by this patch were valid. Some tests have merely shifted the threading to after loop optimizations, but others ended up with no threading opportunities at all. Surprisingly some tests ended up with more total threads. It was a crapshoot all around. On a postive note, there are 6 tests that no longer XFAIL, and one guality test which now passes. I felt a bit queasy about such a fundamental change wrt threading, so I ran it through my callgrind test harness (.ii files from a bootstrap). There was no change in overall compilation, DOM, or the VRP threaders. However, there was a slight increase of 1.63% in the backward threader. I'm pretty sure we could reduce this if we incorporated the restrictions into their profitability code. This way we could stop the search when we ran into one of these restrictions. Not sure it's worth it at this point. Tested on x86-64 Linux. Co-authored-by: Richard Biener <rguenther(a)suse.de> gcc/ChangeLog: * tree-ssa-threadupdate.c (cancel_thread): Dump threading reason on the same line as the threading cancellation. (jt_path_registry::cancel_invalid_paths): Avoid rotating loops. Avoid threading through loop headers where the path remains in the loop. libgomp/ChangeLog: * testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail. gcc/testsuite/ChangeLog: * gcc.dg/Warray-bounds-87.c: Remove xfail. * gcc.dg/analyzer/pr94851-2.c: Remove xfail. * gcc.dg/graphite/pr69728.c: Remove xfail. * gcc.dg/graphite/scop-dsyr2k.c: Remove xfail. * gcc.dg/graphite/scop-dsyrk.c: Remove xfail. * gcc.dg/shrink-wrap-loop.c: Remove xfail. * gcc.dg/loop-8.c: Adjust for new threading restrictions. * gcc.dg/tree-ssa/ifc-20040816-1.c: Same. * gcc.dg/tree-ssa/pr21559.c: Same. * gcc.dg/tree-ssa/pr59597.c: Same. * gcc.dg/tree-ssa/pr71437.c: Same. * gcc.dg/tree-ssa/pr77445-2.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same. * gcc.dg/vect/bb-slp-16.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-6.c: Remove. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Remove. * gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Remove. * gcc.dg/tree-ssa/ssa-thread-invalid.c: New test. --- gcc/testsuite/gcc.dg/Warray-bounds-87.c | 2 +- gcc/testsuite/gcc.dg/analyzer/pr94851-2.c | 2 +- gcc/testsuite/gcc.dg/graphite/pr69728.c | 4 +- gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c | 2 +- gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c | 2 +- gcc/testsuite/gcc.dg/loop-8.c | 19 ++-- gcc/testsuite/gcc.dg/shrink-wrap-loop.c | 54 +---------- gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/pr21559.c | 7 +- gcc/testsuite/gcc.dg/tree-ssa/pr59597.c | 10 +- gcc/testsuite/gcc.dg/tree-ssa/pr71437.c | 8 +- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 3 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 27 ------ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c | 21 ----- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c | 14 ++- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 44 --------- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 5 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c | 102 +++++++++++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 70 ++++++++------ gcc/tree-ssa-threadupdate.c | 26 +++++- .../testsuite/libgomp.graphite/force-parallel-5.c | 2 +- 21 files changed, 207 insertions(+), 219 deletions(-) diff --git a/gcc/testsuite/gcc.dg/Warray-bounds-87.c b/gcc/testsuite/gcc.dg/Warray-bounds-87.c index a49874df5da..a5457807c3a 100644 --- a/gcc/testsuite/gcc.dg/Warray-bounds-87.c +++ b/gcc/testsuite/gcc.dg/Warray-bounds-87.c @@ -33,7 +33,7 @@ static unsigned int h (int i, int j) case 9: return j; case 10: - return a[i]; // { dg-bogus "-Warray-bounds" "pr101671" { xfail *-*-* } } + return a[i]; // { dg-bogus "-Warray-bounds" "pr101671" } } return 0; } diff --git a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c index 0acf48810c1..62176bdaee8 100644 --- a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c +++ b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c @@ -45,7 +45,7 @@ int pamark(void) { if (curbp->b_amark == (AMARK *)NULL) curbp->b_amark = p; else - last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" { xfail *-*-* } } */ + last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" } */ } p->m_name = (char)c; /* { dg-bogus "leak of 'p'" "bogus leak" } */ diff --git a/gcc/testsuite/gcc.dg/graphite/pr69728.c b/gcc/testsuite/gcc.dg/graphite/pr69728.c index 69e28318aaf..a6f385749c2 100644 --- a/gcc/testsuite/gcc.dg/graphite/pr69728.c +++ b/gcc/testsuite/gcc.dg/graphite/pr69728.c @@ -24,6 +24,4 @@ fn1 () run into scheduling issues before here, not being able to handle empty domains. */ -/* XFAILed by fix for PR86865. */ - -/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c index 925ae306903..41c91b97b57 100644 --- a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c +++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c @@ -17,4 +17,4 @@ void dsyr2k(int N) { #pragma endscop } -/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c index b748946fabb..e01a517be11 100644 --- a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c +++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c @@ -19,4 +19,4 @@ void dsyrk(int N) #pragma endscop } -/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c index 90ea1c45524..a685fc25056 100644 --- a/gcc/testsuite/gcc.dg/loop-8.c +++ b/gcc/testsuite/gcc.dg/loop-8.c @@ -11,18 +11,23 @@ f (int *a, int *b) { int i; - for (i = 0; i < 100; i++) + i = 100; + if (i > 0) { - int d = 42; + do + { + int d = 42; - a[i] = d; - if (i % 2) - d = i; - b[i] = d; + a[i] = d; + if (i % 2) + d = i; + b[i] = d; + ++i; + } + while (i < 100); } } /* Load of 42 is moved out of the loop, introducing a new pseudo register. */ -/* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */ /* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */ diff --git a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c index 6e1be8937fe..ddc99e6b75a 100644 --- a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c +++ b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c @@ -1,58 +1,6 @@ /* { dg-do compile { target { { { i?86-*-* x86_64-*-* } && lp64 } || { arm_thumb2 } } } } */ /* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ -/* -Our new threader is threading things a bit too early, and causing the -testcase in gcc.dg/shrink-wrap-loop.c to fail. - - The gist is this BB inside a loop: - - <bb 6> : - # p_2 = PHI <p2_6(D)(2), p_12(5)> - if (p_2 != 0B) - goto <bb 3>; [INV] - else - goto <bb 7>; [INV] - -Our threader can move this check outside of the loop (good). This is -done before branch probabilities are calculated and causes the probs -to be calculated as: - -<bb 2> [local count: 216361238]: - if (p2_6(D) != 0B) - goto <bb 7>; [54.59%] - else - goto <bb 6>; [45.41%] - -Logically this seems correct to me. A simple check outside of a loop -should slightly but not overwhelmingly favor a non-zero value. - -Interestingly however, the old threader couldn't get this, but the IL -ended up identical, albeit with different probabilities. What happens -is that, because the old code could not thread this, the p2 != 0 check -would remain inside the loop and probs would be calculated thusly: - - <bb 6> [local count: 1073741824]: - # p_2 = PHI <p2_6(D)(2), p_12(5)> - if (p_2 != 0B) - goto <bb 3>; [94.50%] - else - goto <bb 7>; [5.50%] - -Then when the loop header copying pass ("ch") shuffled things around, -the IL would end up identical to my early threader code, but with the -probabilities would remain as 94.5/5.5. - -The above discrepancy causes the RTL ifcvt pass to generate different -code, and by the time we get to the shrink wrapping pass, things look -sufficiently different such that the legacy code can actually shrink -wrap, whereas our new code does not. - -IMO, if the loop-ch pass moves conditionals outside of a loop, the -probabilities should be adjusted, but that does mean the shrink wrap -won't happen for this contrived testcase. - */ - int foo (int *p1, int *p2); int @@ -68,4 +16,4 @@ test (int *p1, int *p2) return 1; } -/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c index b55a533e374..f8a6495cbaa 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c @@ -39,4 +39,4 @@ int main1 () which is folded by vectorizer. Both outgoing edges must have probability 100% so the resulting profile match after folding. */ /* { dg-final { scan-tree-dump-times "Invalid sum of outgoing probabilities 200.0" 1 "ifcvt" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 1 "ifcvt" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 2 "ifcvt" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c index 51b3b7ac755..43f046edabe 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c @@ -35,10 +35,7 @@ void foo (void) /* First, we should simplify the bits < 0 test within the loop. */ /* { dg-final { scan-tree-dump-times "Simplified relational" 1 "evrp" } } */ -/* Second, we should thread the edge out of the loop via the break - statement. We also realize that the final bytes == 0 test is useless, - and thread over it. We also know that toread != 0 is useless when - entering while loop and thread over it. */ -/* { dg-final { scan-tree-dump-times "Threaded jump" 3 "vrp-thread1" } } */ +/* We used to check for 3 threaded jumps here, but they all would + rotate the loop. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c index 2caa1f532ea..764b3fe2e80 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c @@ -56,11 +56,7 @@ main (int argc, char argv[]) return crc; } -/* Previously we had 3 jump threads, but one of them crossed loops. - The reason the old threader was allowing it, was because there was - an ASSERT_EXPR getting in the way. Without the ASSERT_EXPR, we - have an empty pre-header block as the final block in the thread, - which the threader will simply join with the next block which *is* - in a different loop. */ -/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */ +/* None of the threads we can get in vrp-thread1 are valid. They all + cross or rotate loops. */ +/* { dg-final { scan-tree-dump-not "Registering jump thread" "vrp-thread1" } } */ /* { dg-final { scan-tree-dump-not "joiner" "vrp-thread1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c index a2386ba19f0..eab3a25928e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-ffast-math -O3 -fdump-tree-vrp-thread1-details" } */ +/* { dg-options "-ffast-math -O3 -fdump-tree-dom3-details" } */ int I = 50, J = 50; int S, L; @@ -39,4 +39,8 @@ void foo (int K) bar (LD, SD); } } -/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "vrp-thread1" } } */ + +/* We used to get 1 vrp-thread1 candidates here, but they now get + deferred until after loop opts are done, because they were rotating + loops. */ +/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "dom3" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index 18f7aab2be7..f2a5e78e6be 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,7 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c deleted file mode 100644 index 0246ebf3c63..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ /dev/null @@ -1,27 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -std=gnu89 --param logical-op-non-short-circuit=0" } */ - -#include "ssa-dom-thread-4.c" - -/* On targets that define LOGICAL_OP_NON_SHORT_CIRCUIT to 0, we split both - "a_elt || b_elt" and "b_elt && kill_elt" into two conditions each, - rather than using "(var1 != 0) op (var2 != 0)". Also, as on other targets, - we duplicate the header of the inner "while" loop. There are then - 4 threading opportunities: - - 1x "!a_elt && b_elt" in the outer "while" loop - -> the start of the inner "while" loop, - skipping the known-true "b_elt" in the first condition. - 1x "!b_elt" in the first condition - -> the outer "while" loop's continuation point, - skipping the known-false "b_elt" in the second condition. - 2x "kill_elt->indx >= b_elt->indx" in the first "while" loop - -> "kill_elt->indx == b_elt->indx" in the second condition, - skipping the known-true "b_elt && kill_elt" in the second - condition. - - All the cases are picked up by VRP1 as jump threads. */ - -/* There used to be 6 jump threads found by thread1, but they all - depended on threading through distinct loops in ethread. */ -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c deleted file mode 100644 index 8f0a12c12ee..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c +++ /dev/null @@ -1,21 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */ - -void bla(); - -/* In the following case, we should be able to thread edge through - the loop header. */ - -void thread_entry_through_header (void) -{ - int i; - - for (i = 0; i < 170; i++) - bla (); -} - -/* There's a single jump thread that should be handled by the VRP - jump threading pass. */ -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */ -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */ -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c index 46e464ff26a..9cd463571c4 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */ +/* { dg-options "-O2 -fdump-tree-vrp-thread2-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */ struct bitmap_head_def; typedef struct bitmap_head_def *bitmap; typedef const struct bitmap_head_def *const_bitmap; @@ -53,10 +53,8 @@ bitmap_ior_and_compl (bitmap dst, const_bitmap a, const_bitmap b, return changed; } -/* The block starting the second conditional has 3 incoming edges, - we should thread all three, but due to a bug in the threading - code we missed the edge when the first conditional is false - (b_elt is zero, which means the second conditional is always - zero. VRP1 catches all three. */ -/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */ -/* { dg-final { scan-tree-dump-times "Path crosses loops" 1 "vrp-thread1" } } */ +/* We used to catch 3 jump threads in vrp-thread1, but they all + rotated the loop, so they were disallowed. This in turn created + other opportunities for the other threaders which result in the the + post-loop threader (vrp-thread2) catching more. */ +/* { dg-final { scan-tree-dump-times "Registering jump thread" 5 "vrp-thread2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c deleted file mode 100644 index b0a7d423475..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c +++ /dev/null @@ -1,44 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */ - -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */ - -int sum0, sum1, sum2, sum3; -int foo (char *s, char **ret) -{ - int state=0; - char c; - - for (; *s && state != 4; s++) - { - c = *s; - if (c == '*') - { - s++; - break; - } - switch (state) - { - case 0: - if (c == '+') - state = 1; - else if (c != '-') - sum0+=c; - break; - case 1: - if (c == '+') - state = 2; - else if (c == '-') - state = 0; - else - sum1+=c; - break; - default: - break; - } - - } - *ret = s; - return state; -} diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index 16abcde5053..1da00a691c8 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,15 +1,14 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom3" { target { ! aarch64*-*-* } } } } */ -/* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp2" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp-thread2" { target { ! aarch64*-*-* } } } } */ enum STATE { S0=0, diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c new file mode 100644 index 00000000000..bd56a62a4b4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c @@ -0,0 +1,102 @@ +// { dg-do compile } +// { dg-options "-O2 -fgimple -fdump-statistics" } +// +// This is a collection of seemingly threadble paths that should not be allowed. + +void foobar (int); + +// Possible thread from 2->4->3, but it would rotate the loop. +void __GIMPLE (ssa) +f1 () +{ + int i; + + // Pre-header. + __BB(2): + goto __BB4; + + // Latch. + __BB(3): + foobar (i_1); + i_5 = i_1 + 1; + goto __BB4; + + __BB(4,loop_header(1)): + i_1 = __PHI (__BB2: 0, __BB3: i_5); + if (i_1 != 101) + goto __BB3; + else + goto __BB5; + + __BB(5): + return; + +} + +// Possible thread from 2->3->5 but threading through the empty latch +// would create a non-empty latch. +void __GIMPLE (ssa) +f2 () +{ + int i; + + // Pre-header. + __BB(2): + goto __BB3; + + __BB(3,loop_header(1)): + i_8 = __PHI (__BB5: i_5, __BB2: 0); + foobar (i_8); + i_5 = i_8 + 1; + if (i_5 != 256) + goto __BB5; + else + goto __BB4; + + // Latch. + __BB(5): + goto __BB3; + + __BB(4): + return; + +} + +// Possible thread from 3->5->6->3 but this would thread through the +// header but not exit the loop. +int __GIMPLE (ssa) +f3 (int a) +{ + int i; + + __BB(2): + goto __BB6; + + __BB(3): + if (i_1 != 0) + goto __BB4; + else + goto __BB5; + + __BB(4): + foobar (5); + goto __BB5; + + // Latch. + __BB(5): + i_7 = i_1 + 1; + goto __BB6; + + __BB(6,loop_header(1)): + i_1 = __PHI (__BB2: 1, __BB5: i_7); + if (i_1 <= 99) + goto __BB3; + else + goto __BB7; + + __BB(7): + return; + +} + +// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } } diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index e68a9b62535..4fc176dde84 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -16,41 +16,52 @@ main1 (int dummy) unsigned int *pin = &in[0]; unsigned int *pout = &out[0]; unsigned int a = 0; - - for (i = 0; i < N; i++) + + i = N; + if (i > 0) { - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - if (arr[i] = i) - a = i; - else - a = 2; + do + { + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + if (arr[i] = i) + a = i; + else + a = 2; + } + while (i < N); } a = 0; - /* check results: */ - for (i = 0; i < N; i++) + /* check results: */ + i = N; + if (i > 0) { - if (out[i*8] != in[i*8] + a - || out[i*8 + 1] != in[i*8 + 1] + a - || out[i*8 + 2] != in[i*8 + 2] + a - || out[i*8 + 3] != in[i*8 + 3] + a - || out[i*8 + 4] != in[i*8 + 4] + a - || out[i*8 + 5] != in[i*8 + 5] + a - || out[i*8 + 6] != in[i*8 + 6] + a - || out[i*8 + 7] != in[i*8 + 7] + a) - abort (); + do + { + if (out[i*8] != in[i*8] + a + || out[i*8 + 1] != in[i*8 + 1] + a + || out[i*8 + 2] != in[i*8 + 2] + a + || out[i*8 + 3] != in[i*8 + 3] + a + || out[i*8 + 4] != in[i*8 + 4] + a + || out[i*8 + 5] != in[i*8 + 5] + a + || out[i*8 + 6] != in[i*8 + 6] + a + || out[i*8 + 7] != in[i*8 + 7] + a) + abort (); - if (arr[i] = i) - a = i; - else - a = 2; + if (arr[i] = i) + a = i; + else + a = 2; + i++; + } + while (i < N); } return 0; @@ -66,4 +77,3 @@ int main (void) } /* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */ - diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 32ce1e3af40..293836cdc53 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -278,7 +278,7 @@ cancel_thread (vec<jump_thread_edge *> *path, const char *reason = NULL) if (dump_file && (dump_flags & TDF_DETAILS)) { if (reason) - fprintf (dump_file, "%s:\n", reason); + fprintf (dump_file, "%s: ", reason); dump_jump_thread_path (dump_file, *path, false); fprintf (dump_file, "\n"); @@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) bool seen_latch = false; int loops_crossed = 0; bool crossed_latch = false; + bool crossed_loop_header = false; // Use ->dest here instead of ->src to ignore the first block. The // first block is allowed to be in a different loop, since it'll be // redirected. See similar comment in profitable_path_p: "we don't @@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) ++loops_crossed; } + // ?? Avoid threading through loop headers that remain in the + // loop, as such threadings tend to create sub-loops which + // _might_ be OK ??. + if (e->dest->loop_father->header == e->dest + && !flow_loop_nested_p (exit->dest->loop_father, + e->dest->loop_father)) + crossed_loop_header = true; + if (flag_checking && !m_backedge_threads) gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0); } @@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) cancel_thread (&path, "Path crosses loops"); return true; } + // The path should either start and end in the same loop or exit the + // loop it starts in but never enter a loop. This also catches + // creating irreducible loops, not only rotation. + if (entry->src->loop_father != exit->dest->loop_father + && !flow_loop_nested_p (exit->src->loop_father, + entry->dest->loop_father)) + { + cancel_thread (&path, "Path rotates loop"); + return true; + } + if (crossed_loop_header) + { + cancel_thread (&path, "Path crosses loop header but does not exit it"); + return true; + } return false; } diff --git a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c index b83ca79dfae..de31d6436f5 100644 --- a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c +++ b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c @@ -31,6 +31,6 @@ int main(void) } /* Check that parallel code generation part make the right answer. */ -/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" } } */ /* { dg-final { scan-tree-dump-times "loopfn.0" 4 "optimized" } } */ /* { dg-final { scan-tree-dump-times "loopfn.1" 4 "optimized" } } */ </cut>

4 years, 5 months

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain October 2021