Re: Effect of SMS register move scheduling

25 Aug 2011


      Hi Richard,
...
The effect on my flawed libav microbenchmarks was much greater
than I imagined.  I used the options:
Yeah, thats indeed looks impressive!
btw, do you also have numbers of how much SMS (hopefully) improves
performance on top of the vectorized code?
Thanks,
Revital
...
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the
register scheduling patch alone (not the IV patch).  Only the tests
that have different "before" and "after" code are run.  The results were:
a3dec
 before:  500000 runs take 4.68384s
 after:   500000 runs take 4.61395s
 speedup: x1.02
aes
 before:  500000 runs take 20.0523s
 after:   500000 runs take 16.9722s
 speedup: x1.18
avs
 before:  1000000 runs take 15.4698s
 after:   1000000 runs take 2.23676s
 speedup: x6.92
dxa
 before:  2000000 runs take 18.5848s
 after:   2000000 runs take 4.40607s
 speedup: x4.22
mjpegenc
 before:  500000 runs take 28.6987s
 after:   500000 runs take 7.31342s
 speedup: x3.92
resample
 before:  1000000 runs take 10.418s
 after:   1000000 runs take 1.91016s
 speedup: x5.45
rgb2rgb-rgb24tobgr16
 before:  1000000 runs take 1.60513s
 after:   1000000 runs take 1.15643s
 speedup: x1.39
rgb2rgb-yv12touyvy
 before:  1500000 runs take 3.50122s
 after:   1500000 runs take 3.49887s
 speedup: x1
twinvq
 before:  500000 runs take 0.452423s
 after:   500000 runs take 0.452454s
 speedup: x1
Taking resample as an example: before the patch we had an ii of 27,
stage count of 6, and 12 vector moves.  Vector moves can't be dual
issued, and there was only one free slot, so even in theory, this loop
takes 27 + 12 - 1 = 38 cycles.  Unfortunately, there were so many new
registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves,
so in theory, one iteration should take 28 cycles.  We also don't spill.
So I think the difference really is genuine.  (The large difference
in moves between ii=27 and ii=28 is because in the ii=27 schedule,
a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't
post here.
Richard

linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Effect of SMS register move scheduling