 
            HI Richard,
chain, so what makes the SMS version of it worse than the non-SMS version?
I attached the SMS dump file. The problematic loop is the one with "SMS succeeded 36 2" (there are three loops in total in this file). Due to these accumulators min ii is 36 which seems to cause SMS to take wrong decisions.
SMS iis 36 36 72 (rec_mii, mii, maxii)
btw, examining the following loop without SMS compiled with : -c -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
; I see two vmla.i32 in a raw at the end of it and I wonder why they end up to been so close? (isn't there a delay between them that can be filled by moving vsub between them?)
Thanks, Revital
.L7: mov r1, ip vldmia r5!, {d18-d19} vmovl.s8 q11, d18 add r0, r0, #1 vld1.16 {q12}, [r1]! cmp r0, r7 vmovl.s8 q9, d19 add ip, ip, #32 vmovl.s16 q14, d22 vmovl.s16 q10, d24 vmovl.s16 q13, d25 vmovl.s16 q11, d23 vsub.i32 q10, q14, q10 vld1.16 {q12}, [r1] vsub.i32 q11, q11, q13 vmla.i32 q8, q10, q10 vmovl.s16 q13, d18 vmovl.s16 q10, d24 vmovl.s16 q9, d19 vmovl.s16 q12, d25 vsub.i32 q10, q13, q10 vmla.i32 q8, q11, q11 vsub.i32 q9, q9, q12 vmla.i32 q8, q10, q10 vmla.i32 q8, q9, q9 bcc .L7