While benchmarking the auto-vectoriser on Libav, I noticed a performance regression in gcc 4.7 (both FSF and Linaro) compared to gcc 4.6 in the AAC decoder. I narrowed it down to this function:
static void ps_hybrid_analysis_ileave_c(float (*out)[32][2], float L[2][38][64], int i, int len) { int j;
for (; i < 64; i++) { for (j = 0; j < len; j++) { out[i][j][0] = L[0][j][i]; out[i][j][1] = L[1][j][i]; } } }
While gcc 4.6 does not attempt to vectorise this at all, 4.7 goes crazy with a massive slowdown, about 20x slower than non-vectorised with Linaro 4.7 and much worse with FSF 4.7.
Let me know if you need more information.
Mans Rullgard mans.rullgard@linaro.org wrote:
static void ps_hybrid_analysis_ileave_c(float (*out)[32][2], float L[2][38][64], int i, int len) { int j;
for (; i < 64; i++) { for (j = 0; j < len; j++) { out[i][j][0] = L[0][j][i]; out[i][j][1] = L[1][j][i]; } }
}
While gcc 4.6 does not attempt to vectorise this at all, 4.7 goes crazy with a massive slowdown, about 20x slower than non-vectorised with Linaro 4.7 and much worse with FSF 4.7.
Let me know if you need more information.
Thanks for the report; I can reproduce the problem.
There's a number of issues with how GCC choses the vectorize this loop that we can potentially improve upon. However, it would appear that no matter what, it probably isn't actually helpful to try to vectorize this loop in the first place.
Fortunately, the vectorizer cost model clearly recognizes this fact (and will classify this loop as "not vectorized: vector version will never be profitable").
Unfortunately, it seems that on ARM, the cost model is actually off by default (it is enabled by default only on i386).
We'll have to enable the cost model on ARM by default as well (and probably tune it a bit to avoid regresssions on other benchmarks).
However for now, I'd recommend you use -fvect-cost-model when testing the vectorizer on libav.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294
On 11 June 2012 17:34, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Mans Rullgard mans.rullgard@linaro.org wrote:
static void ps_hybrid_analysis_ileave_c(float (*out)[32][2], float L[2][38][64], int i, int len) { int j;
for (; i < 64; i++) { for (j = 0; j < len; j++) { out[i][j][0] = L[0][j][i]; out[i][j][1] = L[1][j][i]; } } }
While gcc 4.6 does not attempt to vectorise this at all, 4.7 goes crazy with a massive slowdown, about 20x slower than non-vectorised with Linaro 4.7 and much worse with FSF 4.7.
Let me know if you need more information.
Thanks for the report; I can reproduce the problem.
There's a number of issues with how GCC choses the vectorize this loop that we can potentially improve upon. However, it would appear that no matter what, it probably isn't actually helpful to try to vectorize this loop in the first place.
It could be beneficial to merge the stores into a single 64-bit store. In this particular case, it is actually 64-bit aligned, although there's no way for gcc to know this.
Fortunately, the vectorizer cost model clearly recognizes this fact (and will classify this loop as "not vectorized: vector version will never be profitable").
Unfortunately, it seems that on ARM, the cost model is actually off by default (it is enabled by default only on i386).
We'll have to enable the cost model on ARM by default as well (and probably tune it a bit to avoid regresssions on other benchmarks).
However for now, I'd recommend you use -fvect-cost-model when testing the vectorizer on libav.
I'll add that flag and see what happens. Any other flags I should be using?
Mans Rullgard mans.rullgard@linaro.org wrote on 11.06.2012 19:23:53:
On 11 June 2012 17:34, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
There's a number of issues with how GCC choses the vectorize this loop that we can potentially improve upon. However, it would appear that no matter what, it probably isn't actually helpful to try to vectorize
this
loop in the first place.
It could be beneficial to merge the stores into a single 64-bit store.
Right, I guess this could still be done by SLP even if the loop isn't vectorized ...
However for now, I'd recommend you use -fvect-cost-model when testing the vectorizer on libav.
I'll add that flag and see what happens. Any other flags I should be
using?
I can't think of anything else right now ...
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294
linaro-toolchain@lists.linaro.org