On Wed, Sep 15, 2010 at 5:19 AM, Guillaume Letellier Guillaume.Letellier@arm.com wrote:
Hi,
Also available is an early release of optimised string routines for the Cortex-A series, including a mix of NEON and Thumb-2 versions of memcpy(), memset(), strcpy(), strcmp(), and strlen(). For more information see: https://launchpad.net/cortex-strings
My understanding is that the NEON optimisation will give some performance gain *ONLY* on Cortex-A8 but it will also burn more energy. On other CPU, e.g. Cortex-A9, there is no performance gain but still it will cost more energy.
I've heard that too but never had it confirmed. I will ask. The output of this project will be a set of routines specialised for Thumb-2, NEON, Cortex-A8, and Cortex-A9, where there is a benefit in doing variants for each. We need good non-NEON versions as NEON is optional and it can't be used in the Linux kernel.
Linaro toolchain doesn't target a specific platform but is generic for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain?
Sorry, I don't understand the question. We want to spread these routines out and get them integrated into all of the upstream C libraries including NewLib, Bionic, and GLIBC.
The NEON-optimised version is also beneficial for large copies, but it is not on short copies when the NEON unit has to be powered up (Linux kernel will get an exception to turn it on). I guess your benchmark didn't take that into account. Can the NEON-optimised version be changed so that it is not used for small copies?
My understanding is that the NEON unit is on per process, so once you've turned it on once it should stay on. I assume the turn on cost is amortised across a run. Note that if the data is not in the L1 cache then the NEON unit wins even for small-ish (~64 byte) copies.
-- Michael