I've updated:
https://wiki.linaro.org/RichardSandiford/Sandbox/NeonLibAv
so that it gives the output for current trunk, including Ira's commit yesterday to reduce the amount of overpromotion. I also reran the microbenchmarks. The good news is that the vectorised code is now better in all cases than the non-vectorised code.
The biggest winner from last time was rgb24tobgr16_C(). It used to be much worse with vectorisation due to lots of excessive widening. Thanks to Ira's patch, the loop now looks pretty respectable, and is ~3.25x faster than the non-vectorised code.
As well as using a more recent compiler, the new version also uses -mvectorize-with-neon-quad. Once again it shows a significant improvement over the default.
Richard