I had a play with the vecotiser to see how peeling, unrolling, and alignment affected the performance of simple memory bound loops.
The short story is: * For fixed length loops, don't peel * Performance is the same for 8 byte aligned arrays and up * Performance is very similar for unaliged arrays * vld1 is as fast as vldmia * vld1 with specified alignment is much faster than vld1
The loop is the rather ugly and artifical::
void op(struct ains * __restrict out, const struct aints * __restrict in) { for (int i = 0; i < COUNT; i++) { out->v[i] = (in->v[i] * 173) | in->v[i]; } }
where `struct aints` is a aligned structure. I couldn't figure out how to use an aligned typedef of ints without still introducing a runtime check. I assume I was running into some type of runtime alias checking.
This compiled into::
vmov.i32 q10, #173 add r3, r0, #5 0: vldmia r1!, {d16-d17} vmul.i32 q9, q8, q10 vorr q8, q9, q8 vstmia r0!, {d16-d17} cmp r0, r3 bne 0b
I then lied to the compiler by changing the actual alignment at runtime. See: http://people.linaro.org/~michaelh/incoming/runtime-offset.png
The performance didn't change for actual alignments of 8, 16, or 32 bytes.
I then converted the loop into one using vld1 and fed it smaller alignments. See: http://people.linaro.org/~michaelh/incoming/small-offsets.png
The throughput falls into two camps: one of alignments 1, 2, or 4 and one of 8, 16, 32. The throughput is very similar for both camps but has some stange dropoffs at 24 words, around 48 words, and around 96 words. The terminal throughput at 300 words and above is within 0.5 %
I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead.
I then went back to the vectoriser and changed the alignment of the struct to cause peeling to turn on and off. See: http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is partly due to a fixed count loop turning into a runtime count due to unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %.
Raw results and the test cases are available in lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs
A graph of all results is at: http://people.linaro.org/~michaelh/incoming/everything.png
The usual caveats apply: this test was all in L1, only on the A9, and very artificial.
-- Michael
On 30 November 2011 02:33, Michael Hope michael.hope@linaro.org wrote:
I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead.
So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing?
I thought that peeling should be useful at least for the hints.
I then went back to the vectoriser and changed the alignment of the struct to cause peeling to turn on and off. See: http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is partly due to a fixed count loop turning into a runtime count due to unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %.
I see register spills starting from COUNT=36.
Ira
On Thu, Dec 1, 2011 at 12:20 AM, Ira Rosen ira.rosen@linaro.org wrote:
On 30 November 2011 02:33, Michael Hope michael.hope@linaro.org wrote:
I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead.
So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing?
Yip. We currently use a vldmia r1!, {d16-d17} which (on the A9 at least) only works for aligned values and takes the same time as the unaligned-friendly vld1.i64 {d16-d17}, [r1]!
I thought that peeling should be useful at least for the hints.
Peeling and using the vld1.i64 {d16-d17}, [r1:64]! form should be faster for larger loops. For some reason vld1.i64 ..., [r1:128] gives an illegal instruction trap on my board. Note that the :128 is in bits.
I then went back to the vectoriser and changed the alignment of the struct to cause peeling to turn on and off. See: http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is partly due to a fixed count loop turning into a runtime count due to unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %.
I see register spills starting from COUNT=36.
Ah. Does the vectoriser cost model take register pressure into account? How can I turn this on?
-- Michael
On 30 November 2011 20:28, Michael Hope michael.hope@linaro.org wrote:
On Thu, Dec 1, 2011 at 12:20 AM, Ira Rosen ira.rosen@linaro.org wrote:
On 30 November 2011 02:33, Michael Hope michael.hope@linaro.org wrote:
Peeling and using the vld1.i64 {d16-d17}, [r1:64]! form should be faster for larger loops. For some reason vld1.i64 ..., [r1:128] gives an illegal instruction trap on my board. Note that the :128 is in bits.
Are you sure the address is 128 bit aligned ? I think the reason for the failure is the behaviour of memalign. Changing the memalign's on top from 8 to ALIGN appears to fix the problem - or was that deliberate ?
Regards, Ramana
On Thu, Dec 1, 2011 at 12:20 AM, Ira Rosen ira.rosen@linaro.org wrote:
On 30 November 2011 02:33, Michael Hope michael.hope@linaro.org wrote:
I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead.
So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing?
I had a look in the backend and the vld1/vst1 %A operand adds the alignment if known. It correctly adds [r1:64] if I feed in an array of int64s. The code checks based on MEM_ALIGN and MEM_SIZE of the operand: align = MEM_ALIGN (x) >> 3; memsize = INTVAL (MEM_SIZE (x));
Not sure why the backend generates a vldmia instead of a vld1 though.
-- Michael
On 30 November 2011 22:28, Michael Hope michael.hope@linaro.org wrote:
This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %.
I see register spills starting from COUNT=36.
Ah. Does the vectoriser cost model take register pressure into account? How can I turn this on?
No, but the vectorizer doesn't perform loop unrolling either. The unrolling here is done by complete_unroll pass after the vectorization, and AFAIK it doesn't take register pressure into account.
On 1 December 2011 02:40, Michael Hope michael.hope@linaro.org wrote:
I had a look in the backend and the vld1/vst1 %A operand adds the alignment if known. It correctly adds [r1:64] if I feed in an array of int64s. The code checks based on MEM_ALIGN and MEM_SIZE of the operand: align = MEM_ALIGN (x) >> 3; memsize = INTVAL (MEM_SIZE (x));
Not sure why the backend generates a vldmia instead of a vld1 though.
I don't see how the alignment info set by the vectorizer influences MEM_ALIGN. The vectorizer sets align and misalign fields of struct ptr_info_def. I see it used in expand_expr_real_1 for MEM_REF only to decide if there is a need in movmisalign (for unaligned accesses).
MEM_ALIGN is determined in set_mem_attributes_minus_bitpos from DECL_ALIGN or TYPE_ALIGN. For the cases where the vectorizer forces alignment this should work, since we then set DECL_ALIGN (in vect_compute_data_ref_alignment). But peeling obviously doesn't change DECL_ALIGN, so I don't understand how we can create alignment hint in this case with the current code.
Ira
-- Michael
linaro-toolchain@lists.linaro.org