Re: Effect of SMS register move scheduling

26 Aug 2011

Richard Sandiford richard.sandiford@linaro.org writes:
...
Richard Sandiford richard.sandiford@linaro.org writes:
...
Revital Eres revital.eres@linaro.org writes:
...
btw, do you also have numbers of how much SMS (hopefully) improves
performance on top of the vectorized code?
OK, here's a comparison of:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fno-auto-inc-dec


vs:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec

Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize


for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK.
I think it maxes out at around 20 vector double registers if you just
consider the loop body.  So I think this is actually a regalloc failure
rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop.
I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
   -fira-algorithm=priority
(soon this lot won't fit in my emacs window).  I've attached the
results below.  In both cases, the compiler was current trunk with my
move-scheduling patch applied.
Sorry for yet another missive on this subject, but I think part of
the problem is that we still have moves such as:
(insn 106 105 107 5 (set (reg:V4SI 347 [ vect_var_.89 ])
        (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 0)) src/mjpegenc.c:35 754 {*neon_movv4si}
     (nil))
(insn 107 106 108 5 (set (reg:V4SI 348 [ vect_var_.90 ])
        (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 16)) src/mjpegenc.c:35 754 {*neon_movv4si}
     (nil))
(insn 108 107 109 5 (set (reg:V4SI 349 [ vect_var_.91 ])
        (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 32)) src/mjpegenc.c:35 754 {*neon_movv4si}
     (nil))
(insn 109 108 111 5 (set (reg:V4SI 350 [ vect_var_.92 ])
        (subreg:V4SI (reg:XI 391 [ vect_array.88 ]) 48)) src/mjpegenc.c:35 754 {*neon_movv4si}
     (expr_list:REG_DEAD (reg:XI 391 [ vect_array.88 ])
        (nil)))
The hope is that these moves won't actually generate code.  We want the
register allocator to tie the target vector registers to the corresponding
parts of the source "structure" register, so that the move becomes a no-op.
And that does happen in most cases.  It seems to happen more rarely when
there's high register pressure though.
Even so, it's a bad idea to still have those moves around during scheduling,
because the scheduler will have to assume that the moves will produce
real code.  That's true of both SMS and the normal scheduler.
It's worse for SMS because the normal scheduler gets a second chance
(after reload) to fix things up.
Two things stop us from propagating the subreg directly into
the pattern:
- ARM's MODES_TIEABLE_P says that structure modes and vector modes
    can't be tied.  One (correct) effect of this is to give the subreg
    a much higher cost than a plain register.
I think ARM's MODES_TIEABLE_P should be relaxed.  In fact, I think this
    is really a piece that was missing from my CLASS_CANNOT_CHANGE_MODE 
    VFP patch from earlier in the year.
- Even though rtx_costs treats "good" subregs as being as cheap as
    registers, passes like fwprop.c treat registers as being more
    desirable.  E.g. fwprop propagates simple register copies,
    but not copies of a subreg.  That probably made sense with
    the old flow pass and the old register allocators, but I don't
    think it should make much difference with df and IRA.
The patches below give good results even without -fira-algorithm=priority.
I'm going to submit the ARM one after testing.  The fwprop.c one is just
a proof of concept though.
Richard
Index: gcc/config/arm/arm-protos.h
===================================================================

--- gcc/config/arm/arm-protos.h	2011-08-26 09:10:25.387050720 +0100
+++ gcc/config/arm/arm-protos.h	2011-08-26 09:54:46.824238974 +0100
@@ -46,6 +46,7 @@ extern void arm_output_fn_unwind (FILE *
 extern bool arm_vector_mode_supported_p (enum machine_mode);
 extern bool arm_small_register_classes_for_mode_p (enum machine_mode);
 extern int arm_hard_regno_mode_ok (unsigned int, enum machine_mode);
+extern bool arm_modes_tieable_p (enum machine_mode, enum machine_mode);
 extern int const_ok_for_arm (HOST_WIDE_INT);
 extern int arm_split_constant (RTX_CODE, enum machine_mode, rtx,
    		       HOST_WIDE_INT, rtx, rtx, int);
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	2011-08-26 09:10:25.373050768 +0100
+++ gcc/config/arm/arm.c	2011-08-26 09:56:12.093969547 +0100
@@ -18109,6 +18109,29 @@ arm_hard_regno_mode_ok (unsigned int reg
      && regno <= LAST_FPA_REGNUM);
 }
+/* Implement MODES_TIEABLE_P.  */
+
+bool
+arm_modes_tieable_p (enum machine_mode mode1, enum machine_mode mode2)
+{
+  if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2))
+    return true;
+
+  /* We specifically want to allow elements of "structure" modes to
+     be tieable to the structure.  This more general condition allows
+     other rarer situations too.  */
+  if (TARGET_NEON
+      && (VALID_NEON_DREG_MODE (mode1)
+	  || VALID_NEON_QREG_MODE (mode1)
+	  || VALID_NEON_STRUCT_MODE (mode1))
+      && (VALID_NEON_DREG_MODE (mode2)
+	  || VALID_NEON_QREG_MODE (mode2)
+	  || VALID_NEON_STRUCT_MODE (mode2)))
+    return true;
+
+  return false;
+}
+
 /* For efficiency and historical reasons LO_REGS, HI_REGS and CC_REGS are
    not used in arm mode.  */
Index: gcc/config/arm/arm.h
===================================================================
--- gcc/config/arm/arm.h	2011-08-26 09:10:25.387050720 +0100
+++ gcc/config/arm/arm.h	2011-08-26 09:54:46.839238927 +0100
@@ -962,12 +962,7 @@ #define HARD_REGNO_NREGS(REGNO, MODE)  	
 #define HARD_REGNO_MODE_OK(REGNO, MODE)					\
   arm_hard_regno_mode_ok ((REGNO), (MODE))
-/* Value is 1 if it is a good idea to tie two pseudo registers
-   when one has mode MODE1 and one has mode MODE2.
-   If HARD_REGNO_MODE_OK could produce different values for MODE1 and MODE2,
-   for any hard reg, then this must be 0 for correct output.  */
-#define MODES_TIEABLE_P(MODE1, MODE2)  \
-  (GET_MODE_CLASS (MODE1) == GET_MODE_CLASS (MODE2))
+#define MODES_TIEABLE_P(MODE1, MODE2) arm_modes_tieable_p (MODE1, MODE2)
#define VALID_IWMMXT_REG_MODE(MODE) \
  (arm_vector_mode_supported_p (MODE) || (MODE) == DImode)
Index: gcc/fwprop.c
===================================================================
--- gcc/fwprop.c	2011-08-26 09:58:28.829540497 +0100
+++ gcc/fwprop.c	2011-08-26 10:14:03.767707504 +0100
@@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode 
     return NULL_RTX;
flags = 0;
-  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
+  if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == SUBREG)
     flags |= PR_CAN_APPEAR;
   if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
     flags |= PR_HANDLE_MEM;

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Effect of SMS register move scheduling