Hi Richard,
The effect on my flawed libav microbenchmarks was much greater than I imagined. I used the options:
Yeah, thats indeed looks impressive!
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
Thanks, Revital
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the register scheduling patch alone (not the IV patch). Only the tests that have different "before" and "after" code are run. The results were:
a3dec before: 500000 runs take 4.68384s after: 500000 runs take 4.61395s speedup: x1.02 aes before: 500000 runs take 20.0523s after: 500000 runs take 16.9722s speedup: x1.18 avs before: 1000000 runs take 15.4698s after: 1000000 runs take 2.23676s speedup: x6.92 dxa before: 2000000 runs take 18.5848s after: 2000000 runs take 4.40607s speedup: x4.22 mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92 resample before: 1000000 runs take 10.418s after: 1000000 runs take 1.91016s speedup: x5.45 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.60513s after: 1000000 runs take 1.15643s speedup: x1.39 rgb2rgb-yv12touyvy before: 1500000 runs take 3.50122s after: 1500000 runs take 3.49887s speedup: x1 twinvq before: 500000 runs take 0.452423s after: 500000 runs take 0.452454s speedup: x1
Taking resample as an example: before the patch we had an ii of 27, stage count of 6, and 12 vector moves. Vector moves can't be dual issued, and there was only one free slot, so even in theory, this loop takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves, so in theory, one iteration should take 28 cycles. We also don't spill. So I think the difference really is genuine. (The large difference in moves between ii=27 and ii=28 is because in the ii=27 schedule, a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't post here.
Richard
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain