Hi Dave. I've been hacking away and have checked in a couple of
benchmarking and plotting scripts to lp:cortex-strings. The current
results are at:
http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how
long things take to run. I'll leave ursa3 doing these over the
weekend which should flesh this out for the other routines.
Your new memcpy() is looking good as well - as fast as GLIBC.
-- Michael
While out benchmarking today, I ran across code similar to this:
int *a;
int *b;
int *c;
const int ad[320];
const int bd[320];
const int cd[320];
void fill()
{
for (int i = 0; i < 320; i++)
{
a[i] = ad[i];
b[i] = bd[i];
c[i] = cd[i];
}
}
I was surprised and happy to see the vectoriser kick in for the copy.
The inner loop looks like:
add r5, r3, ip
adds r4, r3, r7
vldmia r2!, {d16-d17}
vldmia r1!, {d18-d19}
adds r0, r3, r6
vst1.32 {q9}, [r5]
vst1.32 {q8}, [r4]
vldmia r3, {d16-d17}
adds r3, r3, #16
cmp r3, r8
vst1.32 {q8}, [r0]
bne .L3
so r3 is the loop variable and {ip,r7} are the offsets from r3 to the
destination pointers. Adding a __restrict doesn't change the code.
Richard, will your auto-inc/dec changes combine the final vldmia r3,
add r3 into a vldmia r3! ?
Changing the int *a into in-file arrays like int a[320] gives:
vldmia r0!, {d16-d17}
vldmia r5!, {d18-d19}
vstmia r4!, {d18-d19}
vstmia r1!, {d16-d17}
vldmia r2!, {d16-d17}
vstmia r3!, {d16-d17}
cmp r3, r6
bne .L2
Marking them as extern int a[320] goes back to the first form.
Can we always use the second form? What optimisation is preventing it?
-- Michael
On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilbert(a)linaro.org> wrote:
> Hi Michael,
> I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
> that is armv7
> and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases
> and for large (128K or larger) copies. I've also (accidentally)
> wired the memcpy-hybrid
> one into the Makefile.am (I wasn't sure what the right way to do this
> was - the neon_sources
> seemed a good place for it, but there is nothing currently in there
> that turns off the non-neon
> version).
>
> I'd be interested in seeing the results for both; I've got a bit of
> a soft spot for the hybrid
> solution.
>
> On the memset, yes the 'and' that you added is fine - but I started
> having a play and have
> some performance results (on -t 128) that I don't really understand:
>
>
> 1) and r1,#0xff
> orr r1,r1,r1,lsl#8
> orr r1,r1,r1,lsl#16
>
> That's your solution - and fastest at somewhere around 2270MB/s for
> me - by the TRM I reckon
> that should be 3 cycles.
>
> 2) lsls r1,#24
> orr r1,r1,r1,lsr#8
> orr r1,r1,r1,lsr#16
>
> lsl isn't explicitly listed in the TRM, so I assumed that was the
> same as a move with a constant
> shift, which my reading is that it's a single cycle; and the lsls is 2
> bytes - so you would think
> that should be as fast as yours but 2 bytes smaller - except it's
> reliably down at 2228MB/s - so
> it is slower.
>
> 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to
> the front of that, and got 2248MB/s -
> so being faster with an extra instruction it probably was an alignment issue?
>
> 4) I also tried a pair of bfi's:
> bfi r1,r1, #8, #8
> bfi r1,r1, #16, #16
>
> That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl). After compensating for the loop time, I got:
* lsl: 1.006 s
* ubx: 0.876 s
* and: 0.918 s
even though ubx has a latency of two cycles.
I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %
slower.
-- Michael
* Linaro GCC
Fixed up, committed and posted two bug fixes to my thumb2 constants
patches, found by other people running FSF trunk.
Analysed bug lp:836401 / pr50193, developed a fix, and posted it both
upstream and to launchpad for testing. The launchpad tests have come
back clean, and the patch is approved, but upstream have not approved it
yet.
Posted a query to linaro-dev mailing list asking for ARM CPU ID register
numbers, and got lots back. Entered these into the patch, and begun some
test builds. I will post the new verion upstream, if all's well, next week.
Started looking at an optimization I discussed with Richard Earnshaw in
Cambourne, in which GCC attempts to synthesize constants more
efficiently by reusing constant values already in registers. I've made a
start, but not much more to say just yet.
* Other
- Public holiday Monday.
- Half day leave on Wednesday.
- Internal train session
Continue looking at Richard's micro benchmarks w.r.t SMS.
Examining Ayal's comments to the patch to support instructions with
REG_INC_NOTE in SMS.
(http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01216.html)
Took one day off yestarday (4/9)
Do we know anything about "Csmith"?
Maybe we should try it?
Andrew
-------- Original Message --------
Subject: Re: [PATCH][ARM] pr50193: ICE on a | (b << negative-constant)
Date: Thu, 1 Sep 2011 13:21:38 +0000 (UTC)
From: Joseph S. Myers <joseph(a)codesourcery.com>
To: Andrew Stubbs <ams(a)codesourcery.com>
CC: gcc-patches(a)gcc.gnu.org, patches(a)linaro.org
Newsgroups: gmane.comp.gcc.patches
References: <4E5F6B5F.2020207(a)codesourcery.com>
On Thu, 1 Sep 2011, Andrew Stubbs wrote:
> This patch fixes the problem by merely checking that the constant is positive.
> I've confirmed that values larger than the mode-size are not a problem because
> the compiler optimizes those away earlier, even at -O0.
Do you mean that you have observed for some testcases that they get
optimized away - or do you have reasons (if so, please state them) to
believe that any possible path through the compiler that would result in a
larger constant here (possibly as a result of constant propagation and
other optimizations) will always result in it being optimized away as
well? If it's just observation it would be better to put the complete
check in here.
Quite of few of the Csmith-generated bug reports from John Regehr have
involved constants appearing in unexpected places as a result of
transformations in the compiler. It would probably be a good idea for
someone to try using Csmith to find ARM compiler bugs (both ICEs and
wrong-code); pretty much all the bugs reported have been testing on x86
and x86_64, so it's likely there are quite a few bugs in the ARM back end
that could be found that way.
--
Joseph S. Myers
joseph(a)codesourcery.com
Hi,
libunwind:
* improvements in case the user doesn't use ARM unwind tables but DWARF info
* code used to pick ARM unwind from the crt files which says "cantunwind"
android:
* upgraded working base to 11.08
* continue to port libunwind to android
* noticed a header file clash that causes errors
* finished an android app that uses a native part to crash the process
* as a vehicle to test the modified debuggerd
Regards
Ken
== GDB ==
* Russell King now wants to revert my kernel patch that
fixed #615974; discussed alternative options.
== GCC ==
* Patch review week.
* Analyzed root cause of ICE when building Linux kernel
with mainline GCC (reported by Arnd).
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
== QEmu ==
* Sent 64bit atomic helper fix upstream
* Basic boot time and simple benchmarks v Panda board
* Tested prebuilt images and Peter's latest post-merge QEmu tree
- The full Ubuntu desktop on an emulated Overo is a bit slow -
it's rather short on RAM
- The full Ubuntu desktop on an emulated VExpress isn't bad; it's
got the full 1G; (with particularly grim
line of awk to mount vexpress images based on Peter's
suggestion of the use of 'file')
== String routines ==
* Pushed memcpy and memset up to cortex-strings bzr
* Working through memset issue with Michael
- Made my code a little less sensitive to initial alignment
== Hard float ==
* Testing libffi 3.0.11rc1 - still hasn't got variadic patch in, but
hopeing it will land later in the cycle.
== Other ==
* Excavating inbox after week off.
* Build LMbench and kicked run off on Panda. (Got stuck in some
heuristics under emulation)
Dave