On 21 April 2011 23:38, Richard Sandiford richard.sandiford@linaro.org wrote:
Michael mentioned that some users reported seeing better preformance from RVCT using arm_neon.h then they did when coding directly in assembler. He suggested we try the same thing for GCC. Here's an experiment using the example that Jim Huang posted to the dev list recently:
https://wiki.linaro.org/RichardSandiford/Sandbox/IntrinsicsPerformance
hi Richard,
I appreciate your analysis very much! In fact, that was the practice when I learned ARM NEON.
The summary is that the C version needs to borrow a trick from the assembly code in order to be competitive. If it does that, though, the C code can be faster. I think this is mostly down to scheduling, although I haven't checked in detail yet.
Thanks for the conclusion. Indeed, GCC meeds extra hints for NEON iterative modulo scheduling. Do you have any further plan to improve?
Sincerely, -jserv