On 11 June 2012 21:53, Mans Rullgard mans.rullgard@linaro.org wrote:
On 11 June 2012 02:14, Michael Hope michael.hope@linaro.org wrote:
We talked at Connect about finishing up the cortex-strings work by upstreaming them into Bionic, Newlib, and GLIBC. I've written up one of our standard 'Output' pages:
https://wiki.linaro.org/WorkingGroups/ToolChain/Outputs/CortexStrings
with a summary of what we did, what else exists, benchmark results, and next steps. This can be used to justify the routines to the different upstreams.
The Android guys are going to upstream these to Bionic. I need a volunteer to do Newlib and GLIBC.
One surprise was that the Newlib plain C routines are very good on strings - probably due to a good end of string detector.
Those graphs end at 4k, which is well within even L1 cache.
Yip, that's deliberate. Larger sizes are only relevant for memcpy() and memset() and past results show little change once you go outside the L1. I skipped other alignments as well as profiling SPEC showed that the blocks were almost always eight byte aligned.
How do these functions compare for sizes that hit L2 or external memory? I would expect functions doing some prefetching to perform better there.
The routines don't use explicit preload as the memory access is obvious and better left to the hardware. Having said that, these were run on an OMAP4460 which has the auto preload turned off by default. I'll add check memcpy() for large blocks with and without preloads to my list.
Some time ago, I compared a few memcpy() implementations on large blocks, and the Bionic NEON-optimised one was several times faster than glibc. It is of course possible that glibc has improved since then.
A NEON based memcpy() is twice as fast on the A8 for both a cold L1 and larger blocks as the NEON unit has wider access direct into the L2 cache. The same effect doesn't occur on the A9.
-- Michael