Successfully identified regression in *llvm* in CI configuration tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO. So far, this commit has regressed CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
Culprit: <cut> commit 8fe62b7af1127691d6732438b322e66ae6d39a03 Author: Max Kazantsev mkazantsev@azul.com Date: Thu Apr 22 12:50:38 2021 +0700
[GVN] Introduce loop load PRE
This patch allows PRE of the following type of loads:
``` preheader: br label %loop
loop: br i1 ..., label %merge, label %clobber
clobber: call foo() // Clobbers %p br label %merge
merge: ... br i1 ..., label %loop, label %exit
```
Into ``` preheader: %x0 = load %p br label %loop
loop: %x.pre = phi(x0, x2) br i1 ..., label %merge, label %clobber
clobber: call foo() // Clobbers %p %x1 = load %p br label %merge
merge: x2 = phi(x.pre, x1) ... br i1 ..., label %loop, label %exit
```
So instead of loading from %p on every iteration, we load only when the actual clobber happens. The typical pattern which it is trying to address is: hot loop, with all code inlined and provably having no side effects, and some side-effecting calls on cold path.
The worst overhead from it is, if we always take clobber block, we make 1 more load overall (in preheader). It only matters if loop has very few iteration. If clobber block is not taken at least once, the transform is neutral or profitable.
There are several improvements prospect open up: - We can sometimes be smarter in loop-exiting blocks via split of critical edges; - If we have block frequency info, we can handle multiple clobbers. The only obstacle now is that we don't know if their sum is colder than the header.
Differential Revision: https://reviews.llvm.org/D99926 Reviewed By: reames </cut>
Results regressed to (for first_bad == 8fe62b7af1127691d6732438b322e66ae6d39a03) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -O2_LTO -- artifacts/build-8fe62b7af1127691d6732438b322e66ae6d39a03/results_id: 1 # 464.h264ref,h264ref_base.default regressed by 104
from (for last_good == 722d4d8e7585457d407d0639a4ae2610157e06a8) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -O2_LTO -- artifacts/build-722d4d8e7585457d407d0639a4ae2610157e06a8/results_id: 1
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... Results ID of last_good: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-master-aarch64-spec2k6-O2_LTO/641 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... Results ID of first_bad: tx1_64/tcwg_bmk_llvm_tx1/bisect-llvm-master-aarch64-spec2k6-O2_LTO/642 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Configuration details:
Reproduce builds: <cut> mkdir investigate-llvm-8fe62b7af1127691d6732438b322e66ae6d39a03 cd investigate-llvm-8fe62b7af1127691d6732438b322e66ae6d39a03
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
cd llvm
# Reproduce first_bad build git checkout --detach 8fe62b7af1127691d6732438b322e66ae6d39a03 ../artifacts/test.sh
# Reproduce last_good build git checkout --detach 722d4d8e7585457d407d0639a4ae2610157e06a8 ../artifacts/test.sh
cd .. </cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/c...
Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... Build log: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Full commit (up to 1000 lines): <cut> commit 8fe62b7af1127691d6732438b322e66ae6d39a03 Author: Max Kazantsev mkazantsev@azul.com Date: Thu Apr 22 12:50:38 2021 +0700
[GVN] Introduce loop load PRE
This patch allows PRE of the following type of loads:
``` preheader: br label %loop
loop: br i1 ..., label %merge, label %clobber
clobber: call foo() // Clobbers %p br label %merge
merge: ... br i1 ..., label %loop, label %exit
```
Into ``` preheader: %x0 = load %p br label %loop
loop: %x.pre = phi(x0, x2) br i1 ..., label %merge, label %clobber
clobber: call foo() // Clobbers %p %x1 = load %p br label %merge
merge: x2 = phi(x.pre, x1) ... br i1 ..., label %loop, label %exit
```
So instead of loading from %p on every iteration, we load only when the actual clobber happens. The typical pattern which it is trying to address is: hot loop, with all code inlined and provably having no side effects, and some side-effecting calls on cold path.
The worst overhead from it is, if we always take clobber block, we make 1 more load overall (in preheader). It only matters if loop has very few iteration. If clobber block is not taken at least once, the transform is neutral or profitable.
There are several improvements prospect open up: - We can sometimes be smarter in loop-exiting blocks via split of critical edges; - If we have block frequency info, we can handle multiple clobbers. The only obstacle now is that we don't know if their sum is colder than the header.
Differential Revision: https://reviews.llvm.org/D99926 Reviewed By: reames --- llvm/include/llvm/Transforms/Scalar/GVN.h | 6 ++ llvm/lib/Transforms/Scalar/GVN.cpp | 92 +++++++++++++++++++++++++-- llvm/test/Transforms/GVN/PRE/pre-loop-load.ll | 9 ++- 3 files changed, 98 insertions(+), 9 deletions(-)
diff --git a/llvm/include/llvm/Transforms/Scalar/GVN.h b/llvm/include/llvm/Transforms/Scalar/GVN.h index 13f55ddcf2d6..70662ca213c3 100644 --- a/llvm/include/llvm/Transforms/Scalar/GVN.h +++ b/llvm/include/llvm/Transforms/Scalar/GVN.h @@ -328,6 +328,12 @@ private: bool PerformLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock, UnavailBlkVect &UnavailableBlocks);
+ /// Try to replace a load which executes on each loop iteraiton with Phi + /// translation of load in preheader and load(s) in conditionally executed + /// paths. + bool performLoopLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock, + UnavailBlkVect &UnavailableBlocks); + /// Eliminates partially redundant \p Load, replacing it with \p /// AvailableLoads (connected by Phis if needed). void eliminatePartiallyRedundantLoad( diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp index 846a5cdb33d1..29da739fa16e 100644 --- a/llvm/lib/Transforms/Scalar/GVN.cpp +++ b/llvm/lib/Transforms/Scalar/GVN.cpp @@ -91,13 +91,14 @@ using namespace PatternMatch;
#define DEBUG_TYPE "gvn"
-STATISTIC(NumGVNInstr, "Number of instructions deleted"); -STATISTIC(NumGVNLoad, "Number of loads deleted"); -STATISTIC(NumGVNPRE, "Number of instructions PRE'd"); +STATISTIC(NumGVNInstr, "Number of instructions deleted"); +STATISTIC(NumGVNLoad, "Number of loads deleted"); +STATISTIC(NumGVNPRE, "Number of instructions PRE'd"); STATISTIC(NumGVNBlocks, "Number of blocks merged"); -STATISTIC(NumGVNSimpl, "Number of instructions simplified"); +STATISTIC(NumGVNSimpl, "Number of instructions simplified"); STATISTIC(NumGVNEqProp, "Number of equalities propagated"); -STATISTIC(NumPRELoad, "Number of loads PRE'd"); +STATISTIC(NumPRELoad, "Number of loads PRE'd"); +STATISTIC(NumPRELoopLoad, "Number of loop loads PRE'd");
STATISTIC(IsValueFullyAvailableInBlockNumSpeculationsMax, "Number of blocks speculated as available in " @@ -1447,6 +1448,84 @@ bool GVN::PerformLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock, return true; }
+bool GVN::performLoopLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock, + UnavailBlkVect &UnavailableBlocks) { + if (!LI) + return false; + + const Loop *L = LI->getLoopFor(Load->getParent()); + // TODO: Generalize to other loop blocks that dominate the latch. + if (!L || L->getHeader() != Load->getParent()) + return false; + + BasicBlock *Preheader = L->getLoopPreheader(); + BasicBlock *Latch = L->getLoopLatch(); + if (!Preheader || !Latch) + return false; + + Value *LoadPtr = Load->getPointerOperand(); + // Must be available in preheader. + if (!L->isLoopInvariant(LoadPtr)) + return false; + + // We plan to hoist the load to preheader without introducing a new fault. + // In order to do it, we need to prove that we cannot side-exit the loop + // once loop header is first entered before execution of the load. + if (ICF->isDominatedByICFIFromSameBlock(Load)) + return false; + + BasicBlock *LoopBlock = nullptr; + for (auto *Blocker : UnavailableBlocks) { + // Blockers from outside the loop are handled in preheader. + if (!L->contains(Blocker)) + continue; + + // Only allow one loop block. Loop header is not less frequently executed + // than each loop block, and likely it is much more frequently executed. But + // in case of multiple loop blocks, we need extra information (such as block + // frequency info) to understand whether it is profitable to PRE into + // multiple loop blocks. + if (LoopBlock) + return false; + + // Do not sink into inner loops. This may be non-profitable. + if (L != LI->getLoopFor(Blocker)) + return false; + + // Blocks that dominate the latch execute on every single iteration, maybe + // except the last one. So PREing into these blocks doesn't make much sense + // in most cases. But the blocks that do not necessarily execute on each + // iteration are sometimes much colder than the header, and this is when + // PRE is potentially profitable. + if (DT->dominates(Blocker, Latch)) + return false; + + // Make sure that the terminator itself doesn't clobber. + if (Blocker->getTerminator()->mayWriteToMemory()) + return false; + + LoopBlock = Blocker; + } + + if (!LoopBlock) + return false; + + // Make sure the memory at this pointer cannot be freed, therefore we can + // safely reload from it after clobber. + if (LoadPtr->canBeFreed()) + return false; + + // TODO: Support critical edge splitting if blocker has more than 1 successor. + MapVector<BasicBlock *, Value *> AvailableLoads; + AvailableLoads[LoopBlock] = LoadPtr; + AvailableLoads[Preheader] = LoadPtr; + + LLVM_DEBUG(dbgs() << "GVN REMOVING PRE LOOP LOAD: " << *Load << '\n'); + eliminatePartiallyRedundantLoad(Load, ValuesPerBlock, AvailableLoads); + ++NumPRELoopLoad; + return true; +} + static void reportLoadElim(LoadInst *Load, Value *AvailableValue, OptimizationRemarkEmitter *ORE) { using namespace ore; @@ -1544,7 +1623,8 @@ bool GVN::processNonLocalLoad(LoadInst *Load) { if (!isLoadInLoopPREEnabled() && LI && LI->getLoopFor(Load->getParent())) return Changed;
- return Changed || PerformLoadPRE(Load, ValuesPerBlock, UnavailableBlocks); + return Changed || PerformLoadPRE(Load, ValuesPerBlock, UnavailableBlocks) || + performLoopLoadPRE(Load, ValuesPerBlock, UnavailableBlocks); }
static bool impliesEquivalanceIfTrue(CmpInst* Cmp) { diff --git a/llvm/test/Transforms/GVN/PRE/pre-loop-load.ll b/llvm/test/Transforms/GVN/PRE/pre-loop-load.ll index 15bc49e7ab9a..8ca12284d5c4 100644 --- a/llvm/test/Transforms/GVN/PRE/pre-loop-load.ll +++ b/llvm/test/Transforms/GVN/PRE/pre-loop-load.ll @@ -7,22 +7,25 @@ declare void @may_free_memory()
declare i32 @personality_function()
-; TODO: We can PRE the load from gc-managed memory away from the hot path. +; We can PRE the load from gc-managed memory away from the hot path. define i32 @test_load_on_cold_path_gc(i32 addrspace(1)* %p) gc "statepoint-example" personality i32 ()* @"personality_function" { ; CHECK-LABEL: @test_load_on_cold_path_gc( ; CHECK-NEXT: entry: +; CHECK-NEXT: [[X_PRE1:%.*]] = load i32, i32 addrspace(1)* [[P:%.*]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ] -; CHECK-NEXT: [[X:%.*]] = load i32, i32 addrspace(1)* [[P:%.*]], align 4 +; CHECK-NEXT: [[X:%.*]] = phi i32 [ [[X_PRE1]], [[ENTRY:%.*]] ], [ [[X2:%.*]], [[BACKEDGE:%.*]] ] +; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE]] ] ; CHECK-NEXT: [[COND:%.*]] = icmp ne i32 [[X]], 0 ; CHECK-NEXT: br i1 [[COND]], label [[HOT_PATH:%.*]], label [[COLD_PATH:%.*]] ; CHECK: hot_path: ; CHECK-NEXT: br label [[BACKEDGE]] ; CHECK: cold_path: ; CHECK-NEXT: call void @may_free_memory() +; CHECK-NEXT: [[X_PRE:%.*]] = load i32, i32 addrspace(1)* [[P]], align 4 ; CHECK-NEXT: br label [[BACKEDGE]] ; CHECK: backedge: +; CHECK-NEXT: [[X2]] = phi i32 [ [[X_PRE]], [[COLD_PATH]] ], [ [[X]], [[HOT_PATH]] ] ; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], [[X]] ; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp ult i32 [[IV_NEXT]], 1000 ; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[EXIT:%.*]] </cut>