On 30 November 2011 02:33, Michael Hope michael.hope@linaro.org wrote:
I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead.
So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing?
I thought that peeling should be useful at least for the hints.
I then went back to the vectoriser and changed the alignment of the struct to cause peeling to turn on and off. See: http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is partly due to a fixed count loop turning into a runtime count due to unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %.
I see register spills starting from COUNT=36.
Ira