Michael mentioned that some users reported seeing better preformance from RVCT using arm_neon.h then they did when coding directly in assembler. He suggested we try the same thing for GCC. Here's an experiment using the example that Jim Huang posted to the dev list recently:
https://wiki.linaro.org/RichardSandiford/Sandbox/IntrinsicsPerformance
The summary is that the C version needs to borrow a trick from the assembly code in order to be competitive. If it does that, though, the C code can be faster. I think this is mostly down to scheduling, although I haven't checked in detail yet.
Richard