Re: Effect of SMS register move scheduling

25 Aug 2011


      Richard Sandiford richard.sandiford@linaro.org writes:
...
Revital Eres revital.eres@linaro.org writes:
...
btw, do you also have numbers of how much SMS (hopefully) improves
performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fno-auto-inc-dec


vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec

Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK.
I think it maxes out at around 20 vector double registers if you just
consider the loop body.  So I think this is actually a regalloc failure
rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop.
I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
   -fira-algorithm=priority
(soon this lot won't fit in my emacs window).  I've attached the
results below.  In both cases, the compiler was current trunk with my
move-scheduling patch applied.
I haven't rerun an SMS-vs-non-SMS test, but based on previous results,
mjpegenc and aacsbr-2 become faster with SMS than without.
This doesn't hide the fact that SMS doesn't take register pressure
into account.  But if I haven't completely miscalculated (and I might
have) it seems that even if SMS did have some pressure-tracking
capability, it probably wouldn't have triggered for mjpegenc,
at least not unless it was very conservative.
Richard
a3dec
  before:  500000 runs take 4.61386s
  after:   500000 runs take 4.57584s
  speedup: x1.01
aacsbr-1
  before:  5000000 runs take 4.37384s
  after:   5000000 runs take 4.3739s
  speedup: x1
aacsbr-2
  before:  5000000 runs take 3.09015s
  after:   5000000 runs take 2.30728s
  speedup: x1.34
aacsbr-3
  before:  4000000 runs take 5.63489s
  after:   4000000 runs take 5.63391s
  speedup: x1
aes
  before:  500000 runs take 16.9729s
  after:   500000 runs take 16.9731s
  speedup: x1
avs
  before:  1000000 runs take 2.23682s
  after:   1000000 runs take 2.31372s
  speedup: x0.967
cdgraphics
  before:  1000000 runs take 2.40585s
  after:   1000000 runs take 2.39774s
  speedup: x1
dwt
  before:  2000000 runs take 9.10098s
  after:   2000000 runs take 9.10086s
  speedup: x1
dxa
  before:  2000000 runs take 4.40613s
  after:   2000000 runs take 4.40619s
  speedup: x1
mjpegenc
  before:  500000 runs take 7.31085s
  after:   500000 runs take 3.04492s
  speedup: x2.4
qtrle
  before:  1000000 runs take 4.54471s
  after:   1000000 runs take 4.51578s
  speedup: x1.01
resample
  before:  1000000 runs take 1.91022s
  after:   1000000 runs take 1.92822s
  speedup: x0.991
rgb2rgb-rgb24tobgr16
  before:  1000000 runs take 1.15643s
  after:   1000000 runs take 1.15585s
  speedup: x1
rgb2rgb-rgb24tobgr32
  before:  2000000 runs take 4.5513s
  after:   2000000 runs take 4.5513s
  speedup: x1
rgb2rgb-rgb32tobgr24
  before:  2000000 runs take 3.59665s
  after:   2000000 runs take 3.59671s
  speedup: x1
rgb2rgb-shuffle-bytes
  before:  500000 runs take 2.24115s
  after:   500000 runs take 2.23947s
  speedup: x1
rgb2rgb-yuy2toyv12
  before:  500000 runs take 4.64447s
  after:   500000 runs take 4.51465s
  speedup: x1.03
rgb2rgb-yv12touyvy
  before:  1500000 runs take 3.49857s
  after:   1500000 runs take 4.60797s
  speedup: x0.759
twinvq
  before:  500000 runs take 0.452393s
  after:   500000 runs take 0.4505s
  speedup: x1
wmavoice
  before:  500000 runs take 0.865448s
  after:   500000 runs take 0.868072s
  speedup: x0.997

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Effect of SMS register move scheduling