Re: Memcpy and memset - linaro-toolchain

6 Sep 2011

      On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert david.gilbert@linaro.org wrote:
...
Hi Michael,
 I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
that is armv7
and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases
and for large (128K or larger) copies.   I've also (accidentally)
wired the memcpy-hybrid
one into the Makefile.am (I wasn't sure what the right way to do this
was - the neon_sources
seemed a good place for it, but there is nothing currently in there
that turns off the non-neon
version).
I'd be interested in seeing the results for both; I've got a bit of
a soft spot for the hybrid
solution.
On the memset, yes the 'and' that you added is fine - but I started
having a play and have
some performance results (on -t 128) that I don't really understand:

and r1,#0xff

orr  r1,r1,r1,lsl#8
   orr  r1,r1,r1,lsl#16
That's your solution - and fastest at somewhere around 2270MB/s for
me - by the TRM I reckon
that should be 3 cycles.

lsls r1,#24

orr  r1,r1,r1,lsr#8
   orr  r1,r1,r1,lsr#16
lsl isn't explicitly listed in the TRM, so I assumed that was the
same as a move with a constant
shift, which my reading is that it's a single cycle; and the lsls is 2
bytes - so you would think
that should be as fast as yours but 2 bytes smaller - except it's
reliably down at 2228MB/s - so
it is slower.

Thinking it was an alignment issue I tried adding a mov r5,r1 to

the front of that, and got 2248MB/s -
so being faster with an extra instruction it probably was an alignment issue?

I also tried a pair of bfi's:

bfi r1,r1, #8, #8
  bfi r1,r1, #16, #16
That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl).  After compensating for the loop time, I got:
* lsl: 1.006 s
 * ubx: 0.876 s
 * and: 0.918 s
even though ubx has a latency of two cycles.
I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %
slower.
-- Michael