While benchmarking the auto-vectoriser on Libav, I noticed a performance regression in gcc 4.7 (both FSF and Linaro) compared to gcc 4.6 in the AAC decoder. I narrowed it down to this function:
static void ps_hybrid_analysis_ileave_c(float (*out)[32][2], float L[2][38][64], int i, int len) { int j;
for (; i < 64; i++) { for (j = 0; j < len; j++) { out[i][j][0] = L[0][j][i]; out[i][j][1] = L[1][j][i]; } } }
While gcc 4.6 does not attempt to vectorise this at all, 4.7 goes crazy with a massive slowdown, about 20x slower than non-vectorised with Linaro 4.7 and much worse with FSF 4.7.
Let me know if you need more information.