Just as an FYI, I've added these loops to the libav microbenchmarks
avg-h264-chroma-mc8-8.txt
avg-pixels8-8.txt
ff-h264-idct-add-8-8.txt
ff-put-pixels8x16-8.txt
h264-loop-filter-luma-8.txt
idct-internal-8.txt
put-h264-chroma-mc8-8.txt
put-h264-qpel8-h-lowpass-8.txt
put-h264-qpel8-hv-lowpass-8.txt
put-h264-qpel8-v-lowpass-8.txt
based on Michael's h264 profile. These loops:
decode_residual
ff_h264_decode_mb_cavlc
fill_decode_caches
aren't really the kind of thing that the microbenchmark is designed for;
running the whole h264 benchmark is probably a better test. Some of the
functions in the profile just consist of two copies of a simpler loop,
one after the other, so for those I just used the simpler loop.
Usual microbenchmark caveats apply.
Richard
Hi,
* merged vector over-promotion patch to linaro-gcc-4.6
* committed upstream the change of the default vector size for NEON
* continued working on widening shifts
Ira
Hi Dave. I've been hacking away and have checked in a couple of
benchmarking and plotting scripts to lp:cortex-strings. The current
results are at:
http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how
long things take to run. I'll leave ursa3 doing these over the
weekend which should flesh this out for the other routines.
Your new memcpy() is looking good as well - as fast as GLIBC.
-- Michael
While out benchmarking today, I ran across code similar to this:
int *a;
int *b;
int *c;
const int ad[320];
const int bd[320];
const int cd[320];
void fill()
{
for (int i = 0; i < 320; i++)
{
a[i] = ad[i];
b[i] = bd[i];
c[i] = cd[i];
}
}
I was surprised and happy to see the vectoriser kick in for the copy.
The inner loop looks like:
add r5, r3, ip
adds r4, r3, r7
vldmia r2!, {d16-d17}
vldmia r1!, {d18-d19}
adds r0, r3, r6
vst1.32 {q9}, [r5]
vst1.32 {q8}, [r4]
vldmia r3, {d16-d17}
adds r3, r3, #16
cmp r3, r8
vst1.32 {q8}, [r0]
bne .L3
so r3 is the loop variable and {ip,r7} are the offsets from r3 to the
destination pointers. Adding a __restrict doesn't change the code.
Richard, will your auto-inc/dec changes combine the final vldmia r3,
add r3 into a vldmia r3! ?
Changing the int *a into in-file arrays like int a[320] gives:
vldmia r0!, {d16-d17}
vldmia r5!, {d18-d19}
vstmia r4!, {d18-d19}
vstmia r1!, {d16-d17}
vldmia r2!, {d16-d17}
vstmia r3!, {d16-d17}
cmp r3, r6
bne .L2
Marking them as extern int a[320] goes back to the first form.
Can we always use the second form? What optimisation is preventing it?
-- Michael
On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilbert(a)linaro.org> wrote:
> Hi Michael,
> I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
> that is armv7
> and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases
> and for large (128K or larger) copies. I've also (accidentally)
> wired the memcpy-hybrid
> one into the Makefile.am (I wasn't sure what the right way to do this
> was - the neon_sources
> seemed a good place for it, but there is nothing currently in there
> that turns off the non-neon
> version).
>
> I'd be interested in seeing the results for both; I've got a bit of
> a soft spot for the hybrid
> solution.
>
> On the memset, yes the 'and' that you added is fine - but I started
> having a play and have
> some performance results (on -t 128) that I don't really understand:
>
>
> 1) and r1,#0xff
> orr r1,r1,r1,lsl#8
> orr r1,r1,r1,lsl#16
>
> That's your solution - and fastest at somewhere around 2270MB/s for
> me - by the TRM I reckon
> that should be 3 cycles.
>
> 2) lsls r1,#24
> orr r1,r1,r1,lsr#8
> orr r1,r1,r1,lsr#16
>
> lsl isn't explicitly listed in the TRM, so I assumed that was the
> same as a move with a constant
> shift, which my reading is that it's a single cycle; and the lsls is 2
> bytes - so you would think
> that should be as fast as yours but 2 bytes smaller - except it's
> reliably down at 2228MB/s - so
> it is slower.
>
> 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to
> the front of that, and got 2248MB/s -
> so being faster with an extra instruction it probably was an alignment issue?
>
> 4) I also tried a pair of bfi's:
> bfi r1,r1, #8, #8
> bfi r1,r1, #16, #16
>
> That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl). After compensating for the loop time, I got:
* lsl: 1.006 s
* ubx: 0.876 s
* and: 0.918 s
even though ubx has a latency of two cycles.
I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %
slower.
-- Michael
* Linaro GCC
Fixed up, committed and posted two bug fixes to my thumb2 constants
patches, found by other people running FSF trunk.
Analysed bug lp:836401 / pr50193, developed a fix, and posted it both
upstream and to launchpad for testing. The launchpad tests have come
back clean, and the patch is approved, but upstream have not approved it
yet.
Posted a query to linaro-dev mailing list asking for ARM CPU ID register
numbers, and got lots back. Entered these into the patch, and begun some
test builds. I will post the new verion upstream, if all's well, next week.
Started looking at an optimization I discussed with Richard Earnshaw in
Cambourne, in which GCC attempts to synthesize constants more
efficiently by reusing constant values already in registers. I've made a
start, but not much more to say just yet.
* Other
- Public holiday Monday.
- Half day leave on Wednesday.
- Internal train session
Continue looking at Richard's micro benchmarks w.r.t SMS.
Examining Ayal's comments to the patch to support instructions with
REG_INC_NOTE in SMS.
(http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01216.html)
Took one day off yestarday (4/9)
Do we know anything about "Csmith"?
Maybe we should try it?
Andrew
-------- Original Message --------
Subject: Re: [PATCH][ARM] pr50193: ICE on a | (b << negative-constant)
Date: Thu, 1 Sep 2011 13:21:38 +0000 (UTC)
From: Joseph S. Myers <joseph(a)codesourcery.com>
To: Andrew Stubbs <ams(a)codesourcery.com>
CC: gcc-patches(a)gcc.gnu.org, patches(a)linaro.org
Newsgroups: gmane.comp.gcc.patches
References: <4E5F6B5F.2020207(a)codesourcery.com>
On Thu, 1 Sep 2011, Andrew Stubbs wrote:
> This patch fixes the problem by merely checking that the constant is positive.
> I've confirmed that values larger than the mode-size are not a problem because
> the compiler optimizes those away earlier, even at -O0.
Do you mean that you have observed for some testcases that they get
optimized away - or do you have reasons (if so, please state them) to
believe that any possible path through the compiler that would result in a
larger constant here (possibly as a result of constant propagation and
other optimizations) will always result in it being optimized away as
well? If it's just observation it would be better to put the complete
check in here.
Quite of few of the Csmith-generated bug reports from John Regehr have
involved constants appearing in unexpected places as a result of
transformations in the compiler. It would probably be a good idea for
someone to try using Csmith to find ARM compiler bugs (both ICEs and
wrong-code); pretty much all the bugs reported have been testing on x86
and x86_64, so it's likely there are quite a few bugs in the ARM back end
that could be found that way.
--
Joseph S. Myers
joseph(a)codesourcery.com