linaro-toolchain September 2011

linaro-toolchain@lists.linaro.org

25 participants
70 discussions

Added h264 loops to libav microbenchmarks

by Richard Sandiford

Just as an FYI, I've added these loops to the libav microbenchmarks avg-h264-chroma-mc8-8.txt avg-pixels8-8.txt ff-h264-idct-add-8-8.txt ff-put-pixels8x16-8.txt h264-loop-filter-luma-8.txt idct-internal-8.txt put-h264-chroma-mc8-8.txt put-h264-qpel8-h-lowpass-8.txt put-h264-qpel8-hv-lowpass-8.txt put-h264-qpel8-v-lowpass-8.txt based on Michael's h264 profile. These loops: decode_residual ff_h264_decode_mb_cavlc fill_decode_caches aren't really the kind of thing that the microbenchmark is designed for; running the whole h264 benchmark is probably a better test. Some of the functions in the profile just consist of two copies of a simpler loop, one after the other, so for those I just used the simpler loop. Usual microbenchmark caveats apply. Richard

13 years, 11 months

SMS wiki page

by Revital Eres

Hi, Following our last performance meeting; I started a wiki page which describes how to use SMS: https://wiki.linaro.org/WorkingGroups/ToolChain/UsingSMS Thanks, Revital

13 years, 11 months

[ACTIVITY] September 4-8

by Ira Rosen

Hi, * merged vector over-promotion patch to linaro-gcc-4.6 * committed upstream the change of the default vector size for NEON * continued working on widening shifts Ira

13 years, 11 months

Benchmarking / justifying cortex-strings

by Michael Hope

Hi Dave. I've been hacking away and have checked in a couple of benchmarking and plotting scripts to lp:cortex-strings. The current results are at: http://people.linaro.org/~michaelh/incoming/strings-performance/ All are done on an A9. The results are very incomplete due to how long things take to run. I'll leave ursa3 doing these over the weekend which should flesh this out for the other routines. Your new memcpy() is looking good as well - as fast as GLIBC. -- Michael

13 years, 11 months

Vectorised copy

by Michael Hope

While out benchmarking today, I ran across code similar to this: int *a; int *b; int *c; const int ad[320]; const int bd[320]; const int cd[320]; void fill() { for (int i = 0; i < 320; i++) { a[i] = ad[i]; b[i] = bd[i]; c[i] = cd[i]; } } I was surprised and happy to see the vectoriser kick in for the copy. The inner loop looks like: add r5, r3, ip adds r4, r3, r7 vldmia r2!, {d16-d17} vldmia r1!, {d18-d19} adds r0, r3, r6 vst1.32 {q9}, [r5] vst1.32 {q8}, [r4] vldmia r3, {d16-d17} adds r3, r3, #16 cmp r3, r8 vst1.32 {q8}, [r0] bne .L3 so r3 is the loop variable and {ip,r7} are the offsets from r3 to the destination pointers. Adding a __restrict doesn't change the code. Richard, will your auto-inc/dec changes combine the final vldmia r3, add r3 into a vldmia r3! ? Changing the int *a into in-file arrays like int a[320] gives: vldmia r0!, {d16-d17} vldmia r5!, {d18-d19} vstmia r4!, {d18-d19} vstmia r1!, {d16-d17} vldmia r2!, {d16-d17} vstmia r3!, {d16-d17} cmp r3, r6 bne .L2 Marking them as extern int a[320] goes back to the first form. Can we always use the second form? What optimisation is preventing it? -- Michael

13 years, 11 months

Re: Memcpy and memset

by Michael Hope

On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilbert(a)linaro.org> wrote: > Hi Michael, > I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S > that is armv7 > and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases > and for large (128K or larger) copies. I've also (accidentally) > wired the memcpy-hybrid > one into the Makefile.am (I wasn't sure what the right way to do this > was - the neon_sources > seemed a good place for it, but there is nothing currently in there > that turns off the non-neon > version). > > I'd be interested in seeing the results for both; I've got a bit of > a soft spot for the hybrid > solution. > > On the memset, yes the 'and' that you added is fine - but I started > having a play and have > some performance results (on -t 128) that I don't really understand: > > > 1) and r1,#0xff > orr r1,r1,r1,lsl#8 > orr r1,r1,r1,lsl#16 > > That's your solution - and fastest at somewhere around 2270MB/s for > me - by the TRM I reckon > that should be 3 cycles. > > 2) lsls r1,#24 > orr r1,r1,r1,lsr#8 > orr r1,r1,r1,lsr#16 > > lsl isn't explicitly listed in the TRM, so I assumed that was the > same as a move with a constant > shift, which my reading is that it's a single cycle; and the lsls is 2 > bytes - so you would think > that should be as fast as yours but 2 bytes smaller - except it's > reliably down at 2228MB/s - so > it is slower. > > 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to > the front of that, and got 2248MB/s - > so being faster with an extra instruction it probably was an alignment issue? > > 4) I also tried a pair of bfi's: > bfi r1,r1, #8, #8 > bfi r1,r1, #16, #16 > > That came out at 2228MB/s - and is 4 cycles by the book. Unfortunately you can't tell the performance from the latency. Attached is a micro benchmark that has the three different versions (and, ubx, lsl). After compensating for the loop time, I got: * lsl: 1.006 s * ubx: 0.876 s * and: 0.918 s even though ubx has a latency of two cycles. I then took the AND version and shifted it to the start of the file. This small change in alignment pushed it up to 1.048 s which is 14 % slower. -- Michael

13 years, 11 months

Agenda - performance call - 2011-09-06

by Ramana Radhakrishnan

Hi, I've put a draft agenda for tomorrow's call at https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-09-06 Is there something else folks would like to add to this ? Follow-up on the cortex-strings discussion from today's call ? cheers Ramana

13 years, 11 months

[ACTIVITY] 29th - August - 3rd September

by Andrew Stubbs

* Linaro GCC Fixed up, committed and posted two bug fixes to my thumb2 constants patches, found by other people running FSF trunk. Analysed bug lp:836401 / pr50193, developed a fix, and posted it both upstream and to launchpad for testing. The launchpad tests have come back clean, and the patch is approved, but upstream have not approved it yet. Posted a query to linaro-dev mailing list asking for ARM CPU ID register numbers, and got lots back. Entered these into the patch, and begun some test builds. I will post the new verion upstream, if all's well, next week. Started looking at an optimization I discussed with Richard Earnshaw in Cambourne, in which GCC attempts to synthesize constants more efficiently by reusing constant values already in registers. I've made a start, but not much more to say just yet. * Other - Public holiday Monday. - Half day leave on Wednesday. - Internal train session

13 years, 11 months

[ACTIVITY] Weekly status

by Revital Eres

Continue looking at Richard's micro benchmarks w.r.t SMS. Examining Ayal's comments to the patch to support instructions with REG_INC_NOTE in SMS. (http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01216.html) Took one day off yestarday (4/9)

13 years, 11 months

Csmith

by Andrew Stubbs

Do we know anything about "Csmith"? Maybe we should try it? Andrew -------- Original Message -------- Subject: Re: [PATCH][ARM] pr50193: ICE on a | (b << negative-constant) Date: Thu, 1 Sep 2011 13:21:38 +0000 (UTC) From: Joseph S. Myers <joseph(a)codesourcery.com> To: Andrew Stubbs <ams(a)codesourcery.com> CC: gcc-patches(a)gcc.gnu.org, patches(a)linaro.org Newsgroups: gmane.comp.gcc.patches References: <4E5F6B5F.2020207(a)codesourcery.com> On Thu, 1 Sep 2011, Andrew Stubbs wrote: > This patch fixes the problem by merely checking that the constant is positive. > I've confirmed that values larger than the mode-size are not a problem because > the compiler optimizes those away earlier, even at -O0. Do you mean that you have observed for some testcases that they get optimized away - or do you have reasons (if so, please state them) to believe that any possible path through the compiler that would result in a larger constant here (possibly as a result of constant propagation and other optimizations) will always result in it being optimized away as well? If it's just observation it would be better to put the complete check in here. Quite of few of the Csmith-generated bug reports from John Regehr have involved constants appearing in unexpected places as a result of transformations in the compiler. It would probably be a good idea for someone to try using Csmith to find ARM compiler bugs (both ICEs and wrong-code); pretty much all the bugs reported have been testing on x86 and x86_64, so it's likely there are quite a few bugs in the ARM back end that could be found that way. -- Joseph S. Myers joseph(a)codesourcery.com

13 years, 11 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain September 2011