Richard Sandiford richard.sandiford@linaro.org writes:
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK. I think it maxes out at around 20 vector double registers if you just consider the loop body. So I think this is actually a regalloc failure rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop. I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec -fira-algorithm=priority
(soon this lot won't fit in my emacs window). I've attached the results below. In both cases, the compiler was current trunk with my move-scheduling patch applied.
I haven't rerun an SMS-vs-non-SMS test, but based on previous results, mjpegenc and aacsbr-2 become faster with SMS than without.
This doesn't hide the fact that SMS doesn't take register pressure into account. But if I haven't completely miscalculated (and I might have) it seems that even if SMS did have some pressure-tracking capability, it probably wouldn't have triggered for mjpegenc, at least not unless it was very conservative.
Richard
a3dec before: 500000 runs take 4.61386s after: 500000 runs take 4.57584s speedup: x1.01 aacsbr-1 before: 5000000 runs take 4.37384s after: 5000000 runs take 4.3739s speedup: x1 aacsbr-2 before: 5000000 runs take 3.09015s after: 5000000 runs take 2.30728s speedup: x1.34 aacsbr-3 before: 4000000 runs take 5.63489s after: 4000000 runs take 5.63391s speedup: x1 aes before: 500000 runs take 16.9729s after: 500000 runs take 16.9731s speedup: x1 avs before: 1000000 runs take 2.23682s after: 1000000 runs take 2.31372s speedup: x0.967 cdgraphics before: 1000000 runs take 2.40585s after: 1000000 runs take 2.39774s speedup: x1 dwt before: 2000000 runs take 9.10098s after: 2000000 runs take 9.10086s speedup: x1 dxa before: 2000000 runs take 4.40613s after: 2000000 runs take 4.40619s speedup: x1 mjpegenc before: 500000 runs take 7.31085s after: 500000 runs take 3.04492s speedup: x2.4 qtrle before: 1000000 runs take 4.54471s after: 1000000 runs take 4.51578s speedup: x1.01 resample before: 1000000 runs take 1.91022s after: 1000000 runs take 1.92822s speedup: x0.991 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.15643s after: 1000000 runs take 1.15585s speedup: x1 rgb2rgb-rgb24tobgr32 before: 2000000 runs take 4.5513s after: 2000000 runs take 4.5513s speedup: x1 rgb2rgb-rgb32tobgr24 before: 2000000 runs take 3.59665s after: 2000000 runs take 3.59671s speedup: x1 rgb2rgb-shuffle-bytes before: 500000 runs take 2.24115s after: 500000 runs take 2.23947s speedup: x1 rgb2rgb-yuy2toyv12 before: 500000 runs take 4.64447s after: 500000 runs take 4.51465s speedup: x1.03 rgb2rgb-yv12touyvy before: 1500000 runs take 3.49857s after: 1500000 runs take 4.60797s speedup: x0.759 twinvq before: 500000 runs take 0.452393s after: 500000 runs take 0.4505s speedup: x1 wmavoice before: 500000 runs take 0.865448s after: 500000 runs take 0.868072s speedup: x0.997