Hi,
This week I looked into DENBench: * sad8_c (hot function from mp4encode) needs SLP reduction, but it also contains cond_expr which cannot be vectorized as reduction, so I don't think there is anything I can do here * fdct_int32 (another hot function from mp4encode) now gets vectorized with vzip/vuzp patch, but the vectorization causes performance degradation here because of multiple register spills. I also noticed that vectorizer costs are not set for NEON, i.e., it uses default costs. So, I am now working on costs for NEON and adding registers consideration into vectorizer's cost model.
I also did some general vectorization research, checking opportunities of collaboration with GRAPHITE pass and auto-parallelization.
Ira