Hi Dave. I've been hacking away and have checked in a couple of benchmarking and plotting scripts to lp:cortex-strings. The current results are at: http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how long things take to run. I'll leave ursa3 doing these over the weekend which should flesh this out for the other routines.
Your new memcpy() is looking good as well - as fast as GLIBC.
-- Michael
On Fri, Sep 2, 2011 at 4:08 PM, Michael Hope michael.hope@linaro.org wrote:
Hi Dave. I've been hacking away and have checked in a couple of benchmarking and plotting scripts to lp:cortex-strings. The current results are at: http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how long things take to run. I'll leave ursa3 doing these over the weekend which should flesh this out for the other routines.
Right, that's done. The new graphs are up at: http://people.linaro.org/~michaelh/incoming/strings-performance/
The original data is at: http://people.linaro.org/~michaelh/incoming/strings-performance/epic.txt
Here's the relative performance for all routines with eight byte aligned data and 128 byte blocks: http://people.linaro.org/~michaelh/incoming/strings-performance/top-000128.p...
memchr, memcpy, strcpy, and strlen all look good at this block size.
Here's the speed versus block size for eight byte aligned data: http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memchr... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memset... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcmp... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcpy... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strlen...
memchr is good. memset could be better for blocks of less than 1k. strchr gets second place but is eclipsed by newlib's version. strcmp need work. strcmp is good. strlen is good but the wobbles could use some investigation.
And here's a graph of how the performance changes with initial alignment: http://people.linaro.org/~michaelh/incoming/strings-performance/alignment-me...
-- Michael
On 5 September 2011 04:21, Michael Hope michael.hope@linaro.org wrote:
On Fri, Sep 2, 2011 at 4:08 PM, Michael Hope michael.hope@linaro.org wrote:
Hi Dave. I've been hacking away and have checked in a couple of benchmarking and plotting scripts to lp:cortex-strings. The current results are at: http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how long things take to run. I'll leave ursa3 doing these over the weekend which should flesh this out for the other routines.
Right, that's done. The new graphs are up at: http://people.linaro.org/~michaelh/incoming/strings-performance/
The original data is at: http://people.linaro.org/~michaelh/incoming/strings-performance/epic.txt
Here's the relative performance for all routines with eight byte aligned data and 128 byte blocks: http://people.linaro.org/~michaelh/incoming/strings-performance/top-000128.p...
memchr, memcpy, strcpy, and strlen all look good at this block size.
Good.
Here's the speed versus block size for eight byte aligned data: http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memchr...
Nice; odd dip between 8 and 16 chars - I don't switch to the smarter stuff until 16 bytes.
http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memset...
Hmm yes the short ones could be a bit faster - I always tended to use log X scales :-) The really small ones I wouldn't worry too much about, the interesting stuff is 32-512 where I'd have expected it to have got it's act in gear.
http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr... http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr...
The version of strchr that's in there is the simple-as-possible strchr; it's byte at a time - I also have a version that uses similar code to memchr that goes fast at large sizes but is slower for small matches:
See: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?act...
I'd made the call that performance at smaller strings was probably more important.
http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcmp...
Huh? I haven't written a strcmp - that looks like newlibs?
http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcpy...
Ditto.
http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strlen...
That's very nice - although quite bizarre; even the lower end of the steps are suitably fast so not really anything to worry about; but it would be great to understand where the 1500 cycle difference is going at the large end.
Dave
On Mon, Sep 5, 2011 at 9:32 PM, David Gilbert david.gilbert@linaro.org wrote:
On 5 September 2011 04:21, Michael Hope michael.hope@linaro.org wrote:
On Fri, Sep 2, 2011 at 4:08 PM, Michael Hope michael.hope@linaro.org wrote: http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strlen...
That's very nice - although quite bizarre; even the lower end of the steps are suitably fast so not really anything to worry about; but it would be great to understand where the 1500 cycle difference is going at the large end.
I've re-run the strlen tests on four different A9 chips which cover four different revisions of the A9 core. See: http://people.linaro.org/~michaelh/incoming/variants-strlen-08.png
I'm afraid I don't know how to turn the /proc/cpuinfo variant and revision into an ARM rxpy. vela is v1:r0. ursa is a v1:r2. leo is a v2:r1. silverbell is a v0:r1. All machines have different clock speeds so I've normalised the graphs at their 64 byte loop performance.
The two v1 devices have the funny response past 256 bytes. The v0 and v2 devices don't. It could be something to do with the variant or the memory subsystems.
The good news is that the new strlen() is much faster than the alternatives so we're not blocked.
-- Michael
On 8 September 2011 05:35, Michael Hope michael.hope@linaro.org wrote:
On Mon, Sep 5, 2011 at 9:32 PM, David Gilbert david.gilbert@linaro.org wrote:
On 5 September 2011 04:21, Michael Hope michael.hope@linaro.org wrote:
On Fri, Sep 2, 2011 at 4:08 PM, Michael Hope michael.hope@linaro.org wrote: http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strlen...
That's very nice - although quite bizarre; even the lower end of the steps are suitably fast so not really anything to worry about; but it would be great to understand where the 1500 cycle difference is going at the large end.
I've re-run the strlen tests on four different A9 chips which cover four different revisions of the A9 core. See: http://people.linaro.org/~michaelh/incoming/variants-strlen-08.png
I'm afraid I don't know how to turn the /proc/cpuinfo variant and revision into an ARM rxpy. vela is v1:r0. ursa is a v1:r2. leo is a v2:r1. silverbell is a v0:r1.
The way I've interpreted this is to do
s/r/p and s/v/r in the v[0-9]:r[0-9] strings above.
Someone from the kernel team can correct me if I am wrong.
Ramana
On Mon, Sep 05, 2011 at 03:21:49PM +1200, Michael Hope wrote:
memchr is good. memset could be better for blocks of less than 1k. strchr gets second place but is eclipsed by newlib's version. strcmp need work. strcmp is good.
It's strcpy which is good in this last sentence, though it basically matches newlib's version.
I'm curious about the "political" side of cortexstrings -- is there active interest by the library maintainers in picking up our versions?
On 5 September 2011 17:40, Christian Robottom Reis kiko@linaro.org wrote:
On Mon, Sep 05, 2011 at 03:21:49PM +1200, Michael Hope wrote:
memchr is good. memset could be better for blocks of less than 1k. strchr gets second place but is eclipsed by newlib's version. strcmp need work. strcmp is good.
It's strcpy which is good in this last sentence, though it basically matches newlib's version.
I think that's because it IS newlib's version - I've not done a strcpy.
I'm curious about the "political" side of cortexstrings -- is there active interest by the library maintainers in picking up our versions?
There is interest from partners in having optimised versions; I think the library maintainers are happy to take it if you can convince them that they are improvements.
Dave
linaro-toolchain@lists.linaro.org