Richard Sandiford richard.sandiford@linaro.org writes:
Richard Sandiford richard.sandiford@linaro.org writes:
Revital Eres revital.eres@linaro.org writes:
btw, do you also have numbers of how much SMS (hopefully) improves performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fno-auto-inc-dec
vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK. I think it maxes out at around 20 vector double registers if you just consider the loop body. So I think this is actually a regalloc failure rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop. I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec -fira-algorithm=priority
(soon this lot won't fit in my emacs window). I've attached the results below. In both cases, the compiler was current trunk with my move-scheduling patch applied.
Sorry for yet another missive on this subject, but I think part of the problem is that we still have moves such as:
(insn 106 105 107 5 (set (reg:V4SI 347 [ vect_var_.89 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 0)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 107 106 108 5 (set (reg:V4SI 348 [ vect_var_.90 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 16)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 108 107 109 5 (set (reg:V4SI 349 [ vect_var_.91 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 32)) src/mjpegenc.c:35 754 {*neon_movv4si} (nil))
(insn 109 108 111 5 (set (reg:V4SI 350 [ vect_var_.92 ]) (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 48)) src/mjpegenc.c:35 754 {*neon_movv4si} (expr_list:REG_DEAD (reg:XI 391 [ vect_array.88 ]) (nil)))
The hope is that these moves won't actually generate code. We want the register allocator to tie the target vector registers to the corresponding parts of the source "structure" register, so that the move becomes a no-op. And that does happen in most cases. It seems to happen more rarely when there's high register pressure though.
Even so, it's a bad idea to still have those moves around during scheduling, because the scheduler will have to assume that the moves will produce real code. That's true of both SMS and the normal scheduler. It's worse for SMS because the normal scheduler gets a second chance (after reload) to fix things up.
Two things stop us from propagating the subreg directly into the pattern:
- ARM's MODES_TIEABLE_P says that structure modes and vector modes can't be tied. One (correct) effect of this is to give the subreg a much higher cost than a plain register.
I think ARM's MODES_TIEABLE_P should be relaxed. In fact, I think this is really a piece that was missing from my CLASS_CANNOT_CHANGE_MODE VFP patch from earlier in the year.
- Even though rtx_costs treats "good" subregs as being as cheap as registers, passes like fwprop.c treat registers as being more desirable. E.g. fwprop propagates simple register copies, but not copies of a subreg. That probably made sense with the old flow pass and the old register allocators, but I don't think it should make much difference with df and IRA.
The patches below give good results even without -fira-algorithm=priority. I'm going to submit the ARM one after testing. The fwprop.c one is just a proof of concept though.
Richard
Index: gcc/config/arm/arm-protos.h =================================================================== --- gcc/config/arm/arm-protos.h 2011-08-26 09:10:25.387050720 +0100 +++ gcc/config/arm/arm-protos.h 2011-08-26 09:54:46.824238974 +0100 @@ -46,6 +46,7 @@ extern void arm_output_fn_unwind (FILE * extern bool arm_vector_mode_supported_p (enum machine_mode); extern bool arm_small_register_classes_for_mode_p (enum machine_mode); extern int arm_hard_regno_mode_ok (unsigned int, enum machine_mode); +extern bool arm_modes_tieable_p (enum machine_mode, enum machine_mode); extern int const_ok_for_arm (HOST_WIDE_INT); extern int arm_split_constant (RTX_CODE, enum machine_mode, rtx, HOST_WIDE_INT, rtx, rtx, int); Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c 2011-08-26 09:10:25.373050768 +0100 +++ gcc/config/arm/arm.c 2011-08-26 09:56:12.093969547 +0100 @@ -18109,6 +18109,29 @@ arm_hard_regno_mode_ok (unsigned int reg && regno <= LAST_FPA_REGNUM); }
+/* Implement MODES_TIEABLE_P. */ + +bool +arm_modes_tieable_p (enum machine_mode mode1, enum machine_mode mode2) +{ + if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2)) + return true; + + /* We specifically want to allow elements of "structure" modes to + be tieable to the structure. This more general condition allows + other rarer situations too. */ + if (TARGET_NEON + && (VALID_NEON_DREG_MODE (mode1) + || VALID_NEON_QREG_MODE (mode1) + || VALID_NEON_STRUCT_MODE (mode1)) + && (VALID_NEON_DREG_MODE (mode2) + || VALID_NEON_QREG_MODE (mode2) + || VALID_NEON_STRUCT_MODE (mode2))) + return true; + + return false; +} + /* For efficiency and historical reasons LO_REGS, HI_REGS and CC_REGS are not used in arm mode. */
Index: gcc/config/arm/arm.h =================================================================== --- gcc/config/arm/arm.h 2011-08-26 09:10:25.387050720 +0100 +++ gcc/config/arm/arm.h 2011-08-26 09:54:46.839238927 +0100 @@ -962,12 +962,7 @@ #define HARD_REGNO_NREGS(REGNO, MODE) #define HARD_REGNO_MODE_OK(REGNO, MODE) \ arm_hard_regno_mode_ok ((REGNO), (MODE))
-/* Value is 1 if it is a good idea to tie two pseudo registers - when one has mode MODE1 and one has mode MODE2. - If HARD_REGNO_MODE_OK could produce different values for MODE1 and MODE2, - for any hard reg, then this must be 0 for correct output. */ -#define MODES_TIEABLE_P(MODE1, MODE2) \ - (GET_MODE_CLASS (MODE1) == GET_MODE_CLASS (MODE2)) +#define MODES_TIEABLE_P(MODE1, MODE2) arm_modes_tieable_p (MODE1, MODE2)
#define VALID_IWMMXT_REG_MODE(MODE) \ (arm_vector_mode_supported_p (MODE) || (MODE) == DImode)
Index: gcc/fwprop.c =================================================================== --- gcc/fwprop.c 2011-08-26 09:58:28.829540497 +0100 +++ gcc/fwprop.c 2011-08-26 10:14:03.767707504 +0100 @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode return NULL_RTX;
flags = 0; - if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) + if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == SUBREG) flags |= PR_CAN_APPEAR; if (!for_each_rtx (&new_rtx, varying_mem_p, NULL)) flags |= PR_HANDLE_MEM;