On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert david.gilbert@linaro.org wrote:
Hi Michael, I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S that is armv7 and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases and for large (128K or larger) copies. I've also (accidentally) wired the memcpy-hybrid one into the Makefile.am (I wasn't sure what the right way to do this was - the neon_sources seemed a good place for it, but there is nothing currently in there that turns off the non-neon version).
I'd be interested in seeing the results for both; I've got a bit of a soft spot for the hybrid solution.
On the memset, yes the 'and' that you added is fine - but I started having a play and have some performance results (on -t 128) that I don't really understand:
- and r1,#0xff
orr r1,r1,r1,lsl#8 orr r1,r1,r1,lsl#16
That's your solution - and fastest at somewhere around 2270MB/s for me - by the TRM I reckon that should be 3 cycles.
- lsls r1,#24
orr r1,r1,r1,lsr#8 orr r1,r1,r1,lsr#16
lsl isn't explicitly listed in the TRM, so I assumed that was the same as a move with a constant shift, which my reading is that it's a single cycle; and the lsls is 2 bytes - so you would think that should be as fast as yours but 2 bytes smaller - except it's reliably down at 2228MB/s - so it is slower.
- Thinking it was an alignment issue I tried adding a mov r5,r1 to
the front of that, and got 2248MB/s - so being faster with an extra instruction it probably was an alignment issue?
- I also tried a pair of bfi's:
bfi r1,r1, #8, #8 bfi r1,r1, #16, #16
That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency. Attached is a micro benchmark that has the three different versions (and, ubx, lsl). After compensating for the loop time, I got:
* lsl: 1.006 s * ubx: 0.876 s * and: 0.918 s
even though ubx has a latency of two cycles.
I then took the AND version and shifted it to the start of the file. This small change in alignment pushed it up to 1.048 s which is 14 % slower.
-- Michael
linaro-toolchain@lists.linaro.org