While out benchmarking today, I ran across code similar to this:
int *a; int *b; int *c;
const int ad[320]; const int bd[320]; const int cd[320];
void fill() { for (int i = 0; i < 320; i++) { a[i] = ad[i]; b[i] = bd[i]; c[i] = cd[i]; } }
I was surprised and happy to see the vectoriser kick in for the copy. The inner loop looks like:
add r5, r3, ip adds r4, r3, r7 vldmia r2!, {d16-d17} vldmia r1!, {d18-d19} adds r0, r3, r6 vst1.32 {q9}, [r5] vst1.32 {q8}, [r4] vldmia r3, {d16-d17} adds r3, r3, #16 cmp r3, r8 vst1.32 {q8}, [r0] bne .L3
so r3 is the loop variable and {ip,r7} are the offsets from r3 to the destination pointers. Adding a __restrict doesn't change the code.
Richard, will your auto-inc/dec changes combine the final vldmia r3, add r3 into a vldmia r3! ?
Changing the int *a into in-file arrays like int a[320] gives:
vldmia r0!, {d16-d17} vldmia r5!, {d18-d19} vstmia r4!, {d18-d19} vstmia r1!, {d16-d17} vldmia r2!, {d16-d17} vstmia r3!, {d16-d17} cmp r3, r6 bne .L2
Marking them as extern int a[320] goes back to the first form.
Can we always use the second form? What optimisation is preventing it?
-- Michael