Michael Hope michael.hope@linaro.org writes:
While out benchmarking today, I ran across code similar to this:
int *a; int *b; int *c;
const int ad[320]; const int bd[320]; const int cd[320];
void fill() { for (int i = 0; i < 320; i++) { a[i] = ad[i]; b[i] = bd[i]; c[i] = cd[i]; } }
I was surprised and happy to see the vectoriser kick in for the copy. The inner loop looks like:
add r5, r3, ip adds r4, r3, r7 vldmia r2!, {d16-d17} vldmia r1!, {d18-d19} adds r0, r3, r6 vst1.32 {q9}, [r5] vst1.32 {q8}, [r4] vldmia r3, {d16-d17} adds r3, r3, #16 cmp r3, r8 vst1.32 {q8}, [r0] bne .L3
so r3 is the loop variable and {ip,r7} are the offsets from r3 to the destination pointers. Adding a __restrict doesn't change the code.
FWIW, this comes from ivopts. I raised the "problem" on gcc@ a few months back, but it seems to be intentional behaviour:
http://gcc.gnu.org/ml/gcc/2011-07/msg00050.html
That is, all things being equal, the current code tends to prefer cases where it can hoist the difference between potential ivs rather than creating separate ivs.
As far as the end of today's meeting goes: ivopts is one of those things on my unwritten list of areas that it would be nice to look at. I posted some benchmark comparing -fivopts with -fno-ivopts to the benchmark list in July. As expected, ivopts does help a lot cases, but there were also a fair number of cases where turning it off significantly improved performance.
Richard, will your auto-inc/dec changes combine the final vldmia r3, add r3 into a vldmia r3! ?
Yeah, it should do.
Richard