Re: [TCWG CI] 453.povray failed to build after llvm: [SLP]Fix reused extracts cost.

7 Dec 2021

I committed a fix yesterday, should be fixed. Another one planning to commit later today or tomorrow.
Best regards,
Alexey Bataev
...
7 дек. 2021 г., в 07:08, Maxim Kuvyrkov maxim.kuvyrkov@linaro.org написал(а):
Hi Alexey,
After your patch Clang crashes while building 453.povray for aarch64-linux-gnu.  Apparently, this happens only with LTO enabled at -O2 and -O3.
Did you get any bug reports against this patch already?
Thanks,
--
Maxim Kuvyrkov
https://www.linaro.org
...
On 5 Dec 2021, at 02:55, ci_notify@linaro.org wrote:
After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1
Author: Alexey Bataev a.bataev@outlook.com
[SLP]Fix reused extracts cost.
the following benchmarks slowed down by more than 2%:

453.povray failed to build

Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:

First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...

Configuration:

Benchmark: SPEC CPU2006
Toolchain: Clang + Glibc + LLVM Linker
Version: all components were built from their tip of trunk
Target: aarch64-linux-gnu
Compiler flags: -O3 -flto
Hardware: NVidia TX1 4x Cortex-A57

This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:

tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO

First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Reproduce builds:
<cut>
mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1
cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit ba74bb3a226e1b4660537f274627285b1bf41ee1
Author: Alexey Bataev <a.bataev@outlook.com>
Date:   Thu Dec 2 04:22:55 2021 -0800
[SLP]Fix reused extracts cost.
If the extractelement instruction is used multiple times in the
  different tree entries (either vectorized, or gathered), need to
  compensate the scalar cost of such instructions. They are completely
  removed if all users are part of the tree but we need to compensate the
  cost only once for each instruction.
Differential Revision: https://reviews.llvm.org/D114958
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp    | 29 +++++++++++++---------
.../X86/extractelement-multiple-uses.ll            | 23 +++++++++--------
2 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 95061e9053fa..335ad6c85387 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, Value *OpValue,
bool BoUpSLP::areAllUsersVectorized(Instruction *I,
                                   ArrayRef<Value *> VectorizedVals) const {
 return (I->hasOneUse() && is_contained(VectorizedVals, I)) ||

    llvm::all_of(I->users(), [this](User *U) {


      return ScalarToTreeEntry.count(U) > 0;




    all_of(I->users(), [this](User *U) {


      return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U);
  });



}
@@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
 // FIXME: it tries to fix a problem with MSVC buildbots.
 TargetTransformInfo &TTIRef = *TTI;
 auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy,

                          VectorizedVals](InstructionCost &Cost,


                                          bool IsGather) {




                          VectorizedVals, E](InstructionCost &Cost) {

DenseMap<Value *, int> ExtractVectorsTys;
SmallPtrSet<Value *, 4> CheckedExtracts;
for (auto *V : VL) {
if (isa<UndefValue>(V))
  continue;

@@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
     // instruction itself is not going to be vectorized, consider this
     // instruction as dead and remove its cost from the final cost of the
     // vectorized tree.

 if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals))




 // Also, avoid adjusting the cost for extractelements with multiple uses


 // in different graph entries.


 const TreeEntry *VE = getTreeEntry(V);


 if (!CheckedExtracts.insert(V).second ||


     !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) ||


     (VE && VE != E))
 continue;

auto *EE = cast<ExtractElementInst>(V);
   Optional<unsigned> EEIdx = getExtractIndex(EE);

@@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
     }
     return GatherCost;
   }

if (isSplat(VL)) {
 // Found the broadcasting of the single scalar, calculate the cost as the


 // broadcast.


 return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy);


}
if ((E->getOpcode() == Instruction::ExtractElement ||
   all_of(E->Scalars,
          [](Value *V) {

@@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
       // single input vector or of 2 input vectors.
       InstructionCost Cost =
           computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI);

   AdjustExtractsCost(Cost, /*IsGather=*/true);




   AdjustExtractsCost(Cost);
 if (NeedToShuffleReuses)
   Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc,
                               FinalVecTy, E->ReuseShuffleIndices);
 return Cost;

}
 }
if (isSplat(VL)) {
 // Found the broadcasting of the single scalar, calculate the cost as the


 // broadcast.


 return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy);


}
InstructionCost ReuseShuffleCost = 0;
if (NeedToShuffleReuses)
ReuseShuffleCost = TTI->getShuffleCost(

@@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
             TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);
       }
     } else {

   AdjustExtractsCost(CommonCost, /*IsGather=*/false);




   AdjustExtractsCost(CommonCost);

}
   return CommonCost;
 }

diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
index c47f255f0bfe..31696752bbb3 100644
--- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
+++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
@@ -2,24 +2,25 @@
; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -march=core-avx2 -pass-remarks-output=%t | FileCheck %s
; RUN: FileCheck %s --input-file=%t --check-prefix=YAML
-; YAML: --- !Missed
+; YAML: --- !Passed
; YAML: Pass:            slp-vectorizer
-; YAML: Name:            NotBeneficial
+; YAML: Name:            VectorizedList
; YAML: Function:        multi_uses
; YAML: Args:
-; YAML:  - String:          'List vectorization was possible but not beneficial with cost '
-; YAML:  - Cost:            '0'
-; YAML:  - String:          ' >= '
-; YAML:  - Treshold:        '0'
+; YAML:  - String:          'SLP vectorized with cost '
+; YAML:  - Cost:            '-1'
+; YAML:  - String:          ' and with tree size '
+; YAML:  - TreeSize:        '3'
define float @multi_uses(<2 x float> %x, <2 x float> %y) {
; CHECK-LABEL: @multi_uses(
-; CHECK-NEXT:    [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0
-; CHECK-NEXT:    [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1
; CHECK-NEXT:    [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1
-; CHECK-NEXT:    [[X0X0:%.*]] = fmul float [[X0]], [[Y1]]
-; CHECK-NEXT:    [[X1X1:%.*]] = fmul float [[X1]], [[Y1]]
-; CHECK-NEXT:    [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]]
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float [[Y1]], i32 0
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[Y1]], i32 1
+; CHECK-NEXT:    [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]]
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]]
; CHECK-NEXT:    ret float [[ADD]]
;
 %x0 = extractelement <2 x float> %x, i32 0
</cut>

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [TCWG CI] 453.povray failed to build after llvm: [SLP]Fix reused extracts cost.

Differential Revision: https://reviews.llvm.org/D114958