Re: [TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler

26 Oct 2021

Hi Jay,
This is a false positive.  We’ll take a look why this report was sent out.
Regards,
--
Maxim Kuvyrkov
https://www.linaro.org
...
On 26 Oct 2021, at 22:19, ci_notify@linaro.org wrote:
After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
Author: Jay Foad jay.foad@amd.com
[AMDGPU] Enable load clustering in the post-RA scheduler
the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%):

433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf samples

Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:

First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...

Configuration:

Benchmark: SPEC CPU2006
Toolchain: Clang + Glibc + LLVM Linker
Version: all components were built from their tip of trunk
Target: aarch64-linux-gnu
Compiler flags: -O2
Hardware: NVidia TX1 4x Cortex-A57

This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:

tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2

First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a...
Reproduce builds:
<cut>
mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec
cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-a... --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
Author: Jay Foad <jay.foad@amd.com>
Date:   Tue Oct 12 15:39:43 2021 +0100
[AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits:

It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
   showed:

The average length of each run of one or more load instructions
increased by about 1%.
The number of runs of two or more load instructions increased by
about 4%.


llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp             | 1 +
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll | 5 ++---
llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll             | 5 +++--
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll     | 4 ++--
llvm/test/CodeGen/AMDGPU/idiv-licm.ll                      | 2 +-
llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll     | 6 +++---
llvm/test/CodeGen/AMDGPU/sdiv64.ll                         | 2 +-
llvm/test/CodeGen/AMDGPU/srem64.ll                         | 2 +-
llvm/test/CodeGen/AMDGPU/udiv64.ll                         | 2 +-
llvm/test/CodeGen/AMDGPU/urem64.ll                         | 2 +-
10 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index b0902465c592..7b2d56e88b5f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -825,6 +825,7 @@ public:
  createPostMachineScheduler(MachineSchedContext *C) const override {
    ScheduleDAGMI *DAG = createGenericSchedPostRA(C);
    const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();

DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));
return DAG;
}

diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
index fa500054e058..804dea705011 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
@@ -185,21 +185,20 @@ define i128 @extractelement_vgpr_v4i128_vgpr_idx(<4 x i128> addrspace(1)* %ptr,
; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT:    v_add_u32_e32 v3, vcc, 16, v0
; GFX8-NEXT:    v_addc_u32_e32 v4, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_dwordx4 v[8:11], v[0:1]
; GFX8-NEXT:    flat_load_dwordx4 v[4:7], v[3:4]
+; GFX8-NEXT:    flat_load_dwordx4 v[8:11], v[0:1]
; GFX8-NEXT:    v_lshlrev_b32_e32 v16, 1, v2
; GFX8-NEXT:    v_add_u32_e32 v17, vcc, 1, v16
; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v17
; GFX8-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v16
; GFX8-NEXT:    v_cmp_eq_u32_e64 s[6:7], 6, v16
; GFX8-NEXT:    v_cmp_eq_u32_e64 s[8:9], 7, v16
-; GFX8-NEXT:    s_waitcnt vmcnt(1)
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
; GFX8-NEXT:    v_cndmask_b32_e64 v2, v8, v10, s[4:5]
; GFX8-NEXT:    v_cndmask_b32_e64 v3, v9, v11, s[4:5]
; GFX8-NEXT:    v_cndmask_b32_e32 v8, v8, v10, vcc
; GFX8-NEXT:    v_cndmask_b32_e32 v9, v9, v11, vcc
; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 2, v16
-; GFX8-NEXT:    s_waitcnt vmcnt(0)
; GFX8-NEXT:    v_cndmask_b32_e32 v2, v2, v4, vcc
; GFX8-NEXT:    v_cndmask_b32_e32 v3, v3, v5, vcc
; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 2, v17
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
index 133a224b7437..bd4ecd3a17e5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
@@ -830,8 +830,8 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32
; GFX9-LABEL: udivrem_v4i32:
; GFX9:       ; %bb.0:
; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x20
-; GFX9-NEXT:    v_mov_b32_e32 v2, 0x4f7ffffe
; GFX9-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x10
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0x4f7ffffe
; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
; GFX9-NEXT:    v_cvt_f32_u32_e32 v0, s0
; GFX9-NEXT:    v_cvt_f32_u32_e32 v1, s1
@@ -926,9 +926,10 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32
;
; GFX10-LABEL: udivrem_v4i32:
; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_clause 0x1
; GFX10-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x20
-; GFX10-NEXT:    v_mov_b32_e32 v4, 0x4f7ffffe
; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x10
+; GFX10-NEXT:    v_mov_b32_e32 v4, 0x4f7ffffe
; GFX10-NEXT:    v_mov_b32_e32 v8, 0
; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
; GFX10-NEXT:    v_cvt_f32_u32_e32 v0, s8
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
index b033497d3aed..81b055166dd2 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
@@ -11236,8 +11236,8 @@ define amdgpu_kernel void @sdiv_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 %
; GFX6-LABEL: sdiv_i64_pow2_shl_denom:
; GFX6:       ; %bb.0:
; GFX6-NEXT:    s_load_dword s4, s[0:1], 0xd
-; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
; GFX6-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
; GFX6-NEXT:    s_mov_b32 s7, 0xf000
; GFX6-NEXT:    s_mov_b32 s6, -1
; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
@@ -13358,8 +13358,8 @@ define amdgpu_kernel void @srem_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 %
; GFX6-LABEL: srem_i64_pow2_shl_denom:
; GFX6:       ; %bb.0:
; GFX6-NEXT:    s_load_dword s4, s[0:1], 0xd
-; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
; GFX6-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
; GFX6-NEXT:    s_mov_b32 s7, 0xf000
; GFX6-NEXT:    s_mov_b32 s6, -1
; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
index fb9348bae000..9ea8f101b5e9 100644
--- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
+++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
@@ -491,8 +491,8 @@ define amdgpu_kernel void @urem16_invariant_denom(i16 addrspace(1)* nocapture %a
; GFX9-LABEL: urem16_invariant_denom:
; GFX9:       ; %bb.0: ; %bb
; GFX9-NEXT:    s_load_dword s2, s[0:1], 0x2c
-; GFX9-NEXT:    s_mov_b32 s6, 0xffff
; GFX9-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x24
+; GFX9-NEXT:    s_mov_b32 s6, 0xffff
; GFX9-NEXT:    v_mov_b32_e32 v1, 0
; GFX9-NEXT:    s_movk_i32 s8, 0x400
; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
index e2fbc0bc4af9..ba093ad3771d 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
@@ -100,14 +100,14 @@ define hidden amdgpu_kernel void @clmem_read(i8 addrspace(1)*  %buffer) {
; GFX900:    global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
;
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
-; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
-; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
-; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
+; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
+; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
+; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
diff --git a/llvm/test/CodeGen/AMDGPU/sdiv64.ll b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
index 0b80b4170316..dbb6d4805495 100644
--- a/llvm/test/CodeGen/AMDGPU/sdiv64.ll
+++ b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
@@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* %out, i64 %x, i64 %y) {
; GCN-LABEL: s_test_sdiv:
; GCN:       ; %bb.0:
; GCN-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0xd
-; GCN-NEXT:    v_mov_b32_e32 v7, 0
; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GCN-NEXT:    v_mov_b32_e32 v7, 0
; GCN-NEXT:    s_mov_b32 s7, 0xf000
; GCN-NEXT:    s_mov_b32 s6, -1
; GCN-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/srem64.ll b/llvm/test/CodeGen/AMDGPU/srem64.ll
index fac510e8dbda..04f8ea10545e 100644
--- a/llvm/test/CodeGen/AMDGPU/srem64.ll
+++ b/llvm/test/CodeGen/AMDGPU/srem64.ll
@@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* %out, i64 %x, i64 %y) {
; GCN-LABEL: s_test_srem:
; GCN:       ; %bb.0:
; GCN-NEXT:    s_load_dwordx2 s[12:13], s[0:1], 0xd
-; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_mov_b32 s7, 0xf000
; GCN-NEXT:    s_mov_b32 s6, -1
; GCN-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/udiv64.ll b/llvm/test/CodeGen/AMDGPU/udiv64.ll
index cc829b8e7eb3..48a86eec9832 100644
--- a/llvm/test/CodeGen/AMDGPU/udiv64.ll
+++ b/llvm/test/CodeGen/AMDGPU/udiv64.ll
@@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* %out, i64 %x, i64 %
; GCN-LABEL: s_test_udiv_i64:
; GCN:       ; %bb.0:
; GCN-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0xd
-; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_mov_b32 s7, 0xf000
; GCN-NEXT:    s_mov_b32 s6, -1
; GCN-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/urem64.ll b/llvm/test/CodeGen/AMDGPU/urem64.ll
index a0a4b73262a7..296aaf2ed1c6 100644
--- a/llvm/test/CodeGen/AMDGPU/urem64.ll
+++ b/llvm/test/CodeGen/AMDGPU/urem64.ll
@@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* %out, i64 %x, i64 %
; GCN-LABEL: s_test_urem_i64:
; GCN:       ; %bb.0:
; GCN-NEXT:    s_load_dwordx2 s[12:13], s[0:1], 0xd
-; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
; GCN-NEXT:    s_mov_b32 s7, 0xf000
; GCN-NEXT:    s_mov_b32 s6, -1
; GCN-NEXT:    s_waitcnt lgkmcnt(0)
</cut>

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler