Following on from yesterday's call about what it would take to enable SMS by default: one of the problems I was seeing with the SMS+IV patch was that we ended up with excessive moves. E.g. a loop such as:
void foo (int *__restrict a, int n) { int i;
for (i = 0; i < n; i += 2) a[i] = a[i] * a[i + 1]; }
would end up being scheduled with an ii of 3, which means that in the ideal case, each loop iteration would take 3 cycles. However, we then added ~8 register moves to the loop in order to satisfy dependencies. Obviously those 8 moves add considerably to the iteration time.
I played around with a heuristic to see whether there were enough free slots in the original schedule to accomodate the moves. That avoided the problem, but it was a hack: the moves weren't actually scheduled in those slots. (In current trunk, the moves generated for an instruction are inserted immediately before that instruction.)
I mentioned this to Revital, who told me that Mustafa Hagog had tried a more complete approach that really did schedule the moves. That patch was quite old, so I ended up reimplementing the same kind of idea in a slightly different way. (The main functional changes from Mustafa's version were to schedule from the end of the window rather than the start, and to use a cyclic window. E.g. moves for an instruction in row 0 column 0 should be scheduled starting at row ii-1 downwards.)
The effect on my flawed libav microbenchmarks was much greater than I imagined. I used the options:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the register scheduling patch alone (not the IV patch). Only the tests that have different "before" and "after" code are run. The results were:
a3dec before: 500000 runs take 4.68384s after: 500000 runs take 4.61395s speedup: x1.02 aes before: 500000 runs take 20.0523s after: 500000 runs take 16.9722s speedup: x1.18 avs before: 1000000 runs take 15.4698s after: 1000000 runs take 2.23676s speedup: x6.92 dxa before: 2000000 runs take 18.5848s after: 2000000 runs take 4.40607s speedup: x4.22 mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92 resample before: 1000000 runs take 10.418s after: 1000000 runs take 1.91016s speedup: x5.45 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.60513s after: 1000000 runs take 1.15643s speedup: x1.39 rgb2rgb-yv12touyvy before: 1500000 runs take 3.50122s after: 1500000 runs take 3.49887s speedup: x1 twinvq before: 500000 runs take 0.452423s after: 500000 runs take 0.452454s speedup: x1
Taking resample as an example: before the patch we had an ii of 27, stage count of 6, and 12 vector moves. Vector moves can't be dual issued, and there was only one free slot, so even in theory, this loop takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves, so in theory, one iteration should take 28 cycles. We also don't spill. So I think the difference really is genuine. (The large difference in moves between ii=27 and ii=28 is because in the ii=27 schedule, a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't post here.
Richard
Hi Richard,
The effect on my flawed libav microbenchmarks was much greater than I imagined. I used the options:
Yeah, thats indeed looks impressive!
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
Thanks, Revital
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the register scheduling patch alone (not the IV patch). Only the tests that have different "before" and "after" code are run. The results were:
a3dec before: 500000 runs take 4.68384s after: 500000 runs take 4.61395s speedup: x1.02 aes before: 500000 runs take 20.0523s after: 500000 runs take 16.9722s speedup: x1.18 avs before: 1000000 runs take 15.4698s after: 1000000 runs take 2.23676s speedup: x6.92 dxa before: 2000000 runs take 18.5848s after: 2000000 runs take 4.40607s speedup: x4.22 mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92 resample before: 1000000 runs take 10.418s after: 1000000 runs take 1.91016s speedup: x5.45 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.60513s after: 1000000 runs take 1.15643s speedup: x1.39 rgb2rgb-yv12touyvy before: 1500000 runs take 3.50122s after: 1500000 runs take 3.49887s speedup: x1 twinvq before: 500000 runs take 0.452423s after: 500000 runs take 0.452454s speedup: x1
Taking resample as an example: before the patch we had an ii of 27, stage count of 6, and 12 vector moves. Vector moves can't be dual issued, and there was only one free slot, so even in theory, this loop takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves, so in theory, one iteration should take 28 cycles. We also don't spill. So I think the difference really is genuine. (The large difference in moves between ii=27 and ii=28 is because in the ii=27 schedule, a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't post here.
Richard
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
(including the register-scheduling patch). As you can see, it's a bit of a mixed bag.
mjpegenc is another case where SMS generates lots of spilling while the normal scheduler doesn't.
Richard
a3dec before: 500000 runs take 4.61447s after: 500000 runs take 4.61377s speedup: x1 aacsbr-1 before: 5000000 runs take 4.08304s after: 5000000 runs take 4.37424s speedup: x0.933 aacsbr-2 before: 5000000 runs take 3.01974s after: 5000000 runs take 3.08987s speedup: x0.977 aacsbr-3 before: 4000000 runs take 5.77838s after: 4000000 runs take 5.63406s speedup: x1.03 aes before: 500000 runs take 24.6801s after: 500000 runs take 16.9731s speedup: x1.45 avs before: 1000000 runs take 2.26315s after: 1000000 runs take 2.23679s speedup: x1.01 cdgraphics before: 1000000 runs take 2.40573s after: 1000000 runs take 2.40582s speedup: x1 dwt before: 2000000 runs take 9.02847s after: 2000000 runs take 9.1022s speedup: x0.992 dxa before: 2000000 runs take 4.55194s after: 2000000 runs take 4.40613s speedup: x1.03 mjpegenc before: 500000 runs take 3.28186s after: 500000 runs take 7.31247s speedup: x0.449 qtrle before: 1000000 runs take 4.52829s after: 1000000 runs take 4.54483s speedup: x0.996 resample before: 1000000 runs take 2.32559s after: 1000000 runs take 1.91016s speedup: x1.22 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.15713s after: 1000000 runs take 1.1557s speedup: x1 rgb2rgb-rgb24tobgr32 before: 2000000 runs take 4.55701s after: 2000000 runs take 4.55148s speedup: x1 rgb2rgb-rgb32tobgr24 before: 2000000 runs take 3.59705s after: 2000000 runs take 3.59683s speedup: x1 rgb2rgb-shuffle-bytes before: 500000 runs take 2.23944s after: 500000 runs take 2.24091s speedup: x0.999 rgb2rgb-yuy2toyv12 before: 500000 runs take 4.51581s after: 500000 runs take 4.51593s speedup: x1 rgb2rgb-yv12touyvy before: 1500000 runs take 3.52603s after: 1500000 runs take 3.49863s speedup: x1.01 twinvq before: 500000 runs take 0.446442s after: 500000 runs take 0.452545s speedup: x0.987 wmavoice before: 500000 runs take 0.864716s after: 500000 runs take 0.864685s speedup: x1
Hi,
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
Thanks. I expected more improvements in aacsbr-2 as I see without the vectorizer options... will look into that.
mjpegenc is another case where SMS generates lots of spilling while the normal scheduler doesn't.
Yes, I also noticed that. When I tested it only one reg-move was created so the scheduling patch would not effect on it.
Thanks again, Revital
Revital Eres revital.eres@linaro.org writes:
mjpegenc is another case where SMS generates lots of spilling while the normal scheduler doesn't.
Yes, I also noticed that. When I tested it only one reg-move was created so the scheduling patch would not effect on it.
FWIW, looking at the results I posted yesterday, the scheduling patch did improve the results compared with the non-scheduling patch:
mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92
That single register move wasn't schedulable within the current ii, so the patch used a higher ii without the move. Unfortunately, while the new loop needs fewer spills, it doesn't avoid them completely.
So I think mjpegenc needs both: the scheduling patch and a fix for the register pressure problem.
Richard
Hi,
Yes, I also noticed that. When I tested it only one reg-move was created so the scheduling patch would not effect on it.
FWIW, looking at the results I posted yesterday, the scheduling patch did improve the results compared with the non-scheduling patch:
You are right! this was my mistake, sorry about that...
mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92
That single register move wasn't schedulable within the current ii, so the patch used a higher ii without the move. Unfortunately, while the new loop needs fewer spills, it doesn't avoid them completely.
Right, this can also be seen in the results you posted today evaluating the effect of SMS on the vectorized code.
Thanks, Revital
Richard Sandiford richard.sandiford@linaro.org writes:
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK. I think it maxes out at around 20 vector double registers if you just consider the loop body. So I think this is actually a regalloc failure rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop. I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec -fira-algorithm=priority
(soon this lot won't fit in my emacs window). I've attached the results below. In both cases, the compiler was current trunk with my move-scheduling patch applied.
I haven't rerun an SMS-vs-non-SMS test, but based on previous results, mjpegenc and aacsbr-2 become faster with SMS than without.
This doesn't hide the fact that SMS doesn't take register pressure into account. But if I haven't completely miscalculated (and I might have) it seems that even if SMS did have some pressure-tracking capability, it probably wouldn't have triggered for mjpegenc, at least not unless it was very conservative.
Richard
a3dec before: 500000 runs take 4.61386s after: 500000 runs take 4.57584s speedup: x1.01 aacsbr-1 before: 5000000 runs take 4.37384s after: 5000000 runs take 4.3739s speedup: x1 aacsbr-2 before: 5000000 runs take 3.09015s after: 5000000 runs take 2.30728s speedup: x1.34 aacsbr-3 before: 4000000 runs take 5.63489s after: 4000000 runs take 5.63391s speedup: x1 aes before: 500000 runs take 16.9729s after: 500000 runs take 16.9731s speedup: x1 avs before: 1000000 runs take 2.23682s after: 1000000 runs take 2.31372s speedup: x0.967 cdgraphics before: 1000000 runs take 2.40585s after: 1000000 runs take 2.39774s speedup: x1 dwt before: 2000000 runs take 9.10098s after: 2000000 runs take 9.10086s speedup: x1 dxa before: 2000000 runs take 4.40613s after: 2000000 runs take 4.40619s speedup: x1 mjpegenc before: 500000 runs take 7.31085s after: 500000 runs take 3.04492s speedup: x2.4 qtrle before: 1000000 runs take 4.54471s after: 1000000 runs take 4.51578s speedup: x1.01 resample before: 1000000 runs take 1.91022s after: 1000000 runs take 1.92822s speedup: x0.991 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.15643s after: 1000000 runs take 1.15585s speedup: x1 rgb2rgb-rgb24tobgr32 before: 2000000 runs take 4.5513s after: 2000000 runs take 4.5513s speedup: x1 rgb2rgb-rgb32tobgr24 before: 2000000 runs take 3.59665s after: 2000000 runs take 3.59671s speedup: x1 rgb2rgb-shuffle-bytes before: 500000 runs take 2.24115s after: 500000 runs take 2.23947s speedup: x1 rgb2rgb-yuy2toyv12 before: 500000 runs take 4.64447s after: 500000 runs take 4.51465s speedup: x1.03 rgb2rgb-yv12touyvy before: 1500000 runs take 3.49857s after: 1500000 runs take 4.60797s speedup: x0.759 twinvq before: 500000 runs take 0.452393s after: 500000 runs take 0.4505s speedup: x1 wmavoice before: 500000 runs take 0.865448s after: 500000 runs take 0.868072s speedup: x0.997
Hi,
Thanks again for measuring this.
mjpegenc before: 500000 runs take 7.31085s after: 500000 runs take 3.04492s speedup: x2.4
mjpegenc and aacsbr-2 contains simple accumulation without load/store dependence and thus SMS succeeds to improve them. aacsbr-1 also contains such accumulation however doloop fails on it. I will try to run it with the recent patch to avoid using doloop (http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01807.html) and with -fira-algorithm=priority that you discovered avoids the spill issue.
Thanks, Revital
Richard Sandiford richard.sandiford@linaro.org writes:
Richard Sandiford richard.sandiford@linaro.org writes:
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK. I think it maxes out at around 20 vector double registers if you just consider the loop body. So I think this is actually a regalloc failure rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop. I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec -fira-algorithm=priority
(soon this lot won't fit in my emacs window). I've attached the results below. In both cases, the compiler was current trunk with my move-scheduling patch applied.
Sorry for yet another missive on this subject, but I think part of the problem is that we still have moves such as:
(insn 106 105 107 5 (set (reg:V4SI 347 [ vect_var_.89 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 0)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 107 106 108 5 (set (reg:V4SI 348 [ vect_var_.90 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 16)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 108 107 109 5 (set (reg:V4SI 349 [ vect_var_.91 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 32)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 109 108 111 5 (set (reg:V4SI 350 [ vect_var_.92 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 48)) src/mjpegenc.c:35 754 {*neon_movv4si} (expr_list:REG_DEAD (reg:XI 391 [ vect_array.88 ]) (nil)))
The hope is that these moves won't actually generate code. We want the register allocator to tie the target vector registers to the corresponding parts of the source "structure" register, so that the move becomes a no-op. And that does happen in most cases. It seems to happen more rarely when there's high register pressure though.
Even so, it's a bad idea to still have those moves around during scheduling, because the scheduler will have to assume that the moves will produce real code. That's true of both SMS and the normal scheduler. It's worse for SMS because the normal scheduler gets a second chance (after reload) to fix things up.
Two things stop us from propagating the subreg directly into the pattern:
- ARM's MODES_TIEABLE_P says that structure modes and vector modes can't be tied. One (correct) effect of this is to give the subreg a much higher cost than a plain register.
I think ARM's MODES_TIEABLE_P should be relaxed. In fact, I think this is really a piece that was missing from my CLASS_CANNOT_CHANGE_MODE VFP patch from earlier in the year.
- Even though rtx_costs treats "good" subregs as being as cheap as registers, passes like fwprop.c treat registers as being more desirable. E.g. fwprop propagates simple register copies, but not copies of a subreg. That probably made sense with the old flow pass and the old register allocators, but I don't think it should make much difference with df and IRA.
The patches below give good results even without -fira-algorithm=priority. I'm going to submit the ARM one after testing. The fwprop.c one is just a proof of concept though.
Richard
Index: gcc/config/arm/arm-protos.h =================================================================== --- gcc/config/arm/arm-protos.h 2011-08-26 09:10:25.387050720 +0100 +++ gcc/config/arm/arm-protos.h 2011-08-26 09:54:46.824238974 +0100 @@ -46,6 +46,7 @@ extern void arm_output_fn_unwind (FILE * extern bool arm_vector_mode_supported_p (enum machine_mode); extern bool arm_small_register_classes_for_mode_p (enum machine_mode); extern int arm_hard_regno_mode_ok (unsigned int, enum machine_mode); +extern bool arm_modes_tieable_p (enum machine_mode, enum machine_mode); extern int const_ok_for_arm (HOST_WIDE_INT); extern int arm_split_constant (RTX_CODE, enum machine_mode, rtx, HOST_WIDE_INT, rtx, rtx, int); Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c 2011-08-26 09:10:25.373050768 +0100 +++ gcc/config/arm/arm.c 2011-08-26 09:56:12.093969547 +0100 @@ -18109,6 +18109,29 @@ arm_hard_regno_mode_ok (unsigned int reg && regno <= LAST_FPA_REGNUM); }
+/* Implement MODES_TIEABLE_P. */ + +bool +arm_modes_tieable_p (enum machine_mode mode1, enum machine_mode mode2) +{ + if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2)) + return true; + + /* We specifically want to allow elements of "structure" modes to + be tieable to the structure. This more general condition allows + other rarer situations too. */ + if (TARGET_NEON + && (VALID_NEON_DREG_MODE (mode1) + || VALID_NEON_QREG_MODE (mode1) + || VALID_NEON_STRUCT_MODE (mode1)) + && (VALID_NEON_DREG_MODE (mode2) + || VALID_NEON_QREG_MODE (mode2) + || VALID_NEON_STRUCT_MODE (mode2))) + return true; + + return false; +} + /* For efficiency and historical reasons LO_REGS, HI_REGS and CC_REGS are not used in arm mode. */
Index: gcc/config/arm/arm.h =================================================================== --- gcc/config/arm/arm.h 2011-08-26 09:10:25.387050720 +0100 +++ gcc/config/arm/arm.h 2011-08-26 09:54:46.839238927 +0100 @@ -962,12 +962,7 @@ #define HARD_REGNO_NREGS(REGNO, MODE) #define HARD_REGNO_MODE_OK(REGNO, MODE) \ arm_hard_regno_mode_ok ((REGNO), (MODE))
-/* Value is 1 if it is a good idea to tie two pseudo registers - when one has mode MODE1 and one has mode MODE2. - If HARD_REGNO_MODE_OK could produce different values for MODE1 and MODE2, - for any hard reg, then this must be 0 for correct output. */ -#define MODES_TIEABLE_P(MODE1, MODE2) \ - (GET_MODE_CLASS (MODE1) == GET_MODE_CLASS (MODE2)) +#define MODES_TIEABLE_P(MODE1, MODE2) arm_modes_tieable_p (MODE1, MODE2)
#define VALID_IWMMXT_REG_MODE(MODE) \ (arm_vector_mode_supported_p (MODE) || (MODE) == DImode)
Index: gcc/fwprop.c =================================================================== --- gcc/fwprop.c 2011-08-26 09:58:28.829540497 +0100 +++ gcc/fwprop.c 2011-08-26 10:14:03.767707504 +0100 @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode return NULL_RTX;
flags = 0; - if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) + if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == SUBREG) flags |= PR_CAN_APPEAR; if (!for_each_rtx (&new_rtx, varying_mem_p, NULL)) flags |= PR_HANDLE_MEM;
On Thu, Aug 25, 2011 at 09:17:59AM +0100, Richard Sandiford wrote:
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
(including the register-scheduling patch). As you can see, it's a bit of a mixed bag.
Hmm, a mixed bag, really? It looks like only aes and resample truly benefit..
linaro-toolchain@lists.linaro.org