On 11 June 2012 17:34, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Mans Rullgard mans.rullgard@linaro.org wrote:
static void ps_hybrid_analysis_ileave_c(float (*out)[32][2], float L[2][38][64], int i, int len) { int j;
for (; i < 64; i++) { for (j = 0; j < len; j++) { out[i][j][0] = L[0][j][i]; out[i][j][1] = L[1][j][i]; } } }
While gcc 4.6 does not attempt to vectorise this at all, 4.7 goes crazy with a massive slowdown, about 20x slower than non-vectorised with Linaro 4.7 and much worse with FSF 4.7.
Let me know if you need more information.
Thanks for the report; I can reproduce the problem.
There's a number of issues with how GCC choses the vectorize this loop that we can potentially improve upon. However, it would appear that no matter what, it probably isn't actually helpful to try to vectorize this loop in the first place.
It could be beneficial to merge the stores into a single 64-bit store. In this particular case, it is actually 64-bit aligned, although there's no way for gcc to know this.
Fortunately, the vectorizer cost model clearly recognizes this fact (and will classify this loop as "not vectorized: vector version will never be profitable").
Unfortunately, it seems that on ARM, the cost model is actually off by default (it is enabled by default only on i386).
We'll have to enable the cost model on ARM by default as well (and probably tune it a bit to avoid regresssions on other benchmarks).
However for now, I'd recommend you use -fvect-cost-model when testing the vectorizer on libav.
I'll add that flag and see what happens. Any other flags I should be using?