== This week ==
* Wrote some patches to make SMS schedule register moves. They made a
significant difference to some libav loops. I'm running a regression
test on pwoerpc-ibm-aix5.3.0 and will submit upstream next week if
all goes OK.
* Looked at why mjpegenc was so much worse with SMS. Turned out to be
a register spilling problem. Found that -fira-algorithm=priority
avoids the regression and makes several other tests better too.
(I just tested that to see whether there was a feasible register
allocation for these cases; -fira-algorithm=priority isn't the
way to go.)
* Saw that the register allocator seemed to be tripping over the
XImode "structure" values, and that we still had one vector move
per structure element by the time we get to the scheduling passes.
Eliminated those with a combination of one fix and one hack.
It seemed to avoid the allocation problems.
* Patch review (Linaro and upstream).
* Backported libgcc visibility fix to 4.6 and 4.5.
== Next week ==
* Submit register-scheduling patch.
* Submit memory cost patch (from auto-inc-dec changes)
* Possibly submit the auto-inc-dec changes themselves, depending on
how the rtx cost discussion goes.
Richard
==GCC==
===Progress===
* Looked at the vectorize_with_neon_quad failure again and decided
that I had to handle another case but not convinced that the extra
stall we'd get in this case was worth it. In any case it would have
been a workaround but Richard Sandiford fixed this by getting df to do
the right thing which would have been the right fix.
* Backported tbh patch.
* Backported conditional execution improvements patch from Jiangning
to Linaro 4.6 branch.
* Committed the LTO + Neon / Android intrinsics patch.
* Panda seems more reliable this week but I suspect that's the room
cooling more .
* Broke up a few blueprints and marked some as done.
* BRANCH_COST results show not a huge variation in SPEC and there are
some results that are inconsistent.. Need to run a few benchmarks
again Sigh :( .
* Finished the A9 scheduler patch for smull and friends and committed
upstream and into Linaro 4.6.
* Reviewed the shrink-wrapping patch and the widening multiplies patch
for a short duration.
* Looked at the failures in the "popular embedded benchmark" for
sometime with Asa.
* Tried one of the ICE patches and that seemed to work just fine with
bootstrap on FSF trunk. Need to figure out why this was breaking in
the Linaro 4.6 tree. https://bugs.launchpad.net/gcc-linaro/+bug/689887
=== Plans ===
Next Week - Holiday :) Feet not up but walking in what looks like
typical bank holiday weather ... Might check email later in the week.
Meetings:
* 1-1s
* TCWG calls
* Thumb2 performance call.
Absences.
* 29th Aug - Sept. 2 - Holiday booked and approved.
* 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel
to be booked.
* Investigated the errors in the automotive test and concluded that they are
CRC-errors, but not depending on the test case result (non intrusive crc
check). We decided these errors need to be cleared out once and for all.
Michael and Ramana helping out with continued investigation.
* EEMBC run on both Panda and Snowball with gcc4.5.2. Results look
reasonable, but Michael will also have a look. I will spend a little more
time comparing the results from the two boards.
* Started to run SPEC2K on the Panda board.
Best Regards
Åsa
Following on from yesterday's call about what it would take to enable
SMS by default: one of the problems I was seeing with the SMS+IV patch
was that we ended up with excessive moves. E.g. a loop such as:
void
foo (int *__restrict a, int n)
{
int i;
for (i = 0; i < n; i += 2)
a[i] = a[i] * a[i + 1];
}
would end up being scheduled with an ii of 3, which means that in the
ideal case, each loop iteration would take 3 cycles. However, we then
added ~8 register moves to the loop in order to satisfy dependencies.
Obviously those 8 moves add considerably to the iteration time.
I played around with a heuristic to see whether there were enough
free slots in the original schedule to accomodate the moves.
That avoided the problem, but it was a hack: the moves weren't
actually scheduled in those slots. (In current trunk, the moves
generated for an instruction are inserted immediately before that
instruction.)
I mentioned this to Revital, who told me that Mustafa Hagog had
tried a more complete approach that really did schedule the moves.
That patch was quite old, so I ended up reimplementing the same kind
of idea in a slightly different way. (The main functional changes
from Mustafa's version were to schedule from the end of the window
rather than the start, and to use a cyclic window. E.g. moves for
an instruction in row 0 column 0 should be scheduled starting at
row ii-1 downwards.)
The effect on my flawed libav microbenchmarks was much greater
than I imagined. I used the options:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the
register scheduling patch alone (not the IV patch). Only the tests
that have different "before" and "after" code are run. The results were:
a3dec
before: 500000 runs take 4.68384s
after: 500000 runs take 4.61395s
speedup: x1.02
aes
before: 500000 runs take 20.0523s
after: 500000 runs take 16.9722s
speedup: x1.18
avs
before: 1000000 runs take 15.4698s
after: 1000000 runs take 2.23676s
speedup: x6.92
dxa
before: 2000000 runs take 18.5848s
after: 2000000 runs take 4.40607s
speedup: x4.22
mjpegenc
before: 500000 runs take 28.6987s
after: 500000 runs take 7.31342s
speedup: x3.92
resample
before: 1000000 runs take 10.418s
after: 1000000 runs take 1.91016s
speedup: x5.45
rgb2rgb-rgb24tobgr16
before: 1000000 runs take 1.60513s
after: 1000000 runs take 1.15643s
speedup: x1.39
rgb2rgb-yv12touyvy
before: 1500000 runs take 3.50122s
after: 1500000 runs take 3.49887s
speedup: x1
twinvq
before: 500000 runs take 0.452423s
after: 500000 runs take 0.452454s
speedup: x1
Taking resample as an example: before the patch we had an ii of 27,
stage count of 6, and 12 vector moves. Vector moves can't be dual
issued, and there was only one free slot, so even in theory, this loop
takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new
registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves,
so in theory, one iteration should take 28 cycles. We also don't spill.
So I think the difference really is genuine. (The large difference
in moves between ii=27 and ii=28 is because in the ii=27 schedule,
a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't
post here.
Richard
Hello,
Following today performance call
(https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-08-23)
here are some points raised regarding the steps towards enabling SMS by default:
* Benchmarks testing:
-- Running benchmarks as EEMBC and SPEC2006 with SMS enabled is
crucial to expose loops where SMS degrades the performance. those
loops need to be analysed to construct a cost model.
-- SMS increases code size by introducing prologue and epilogue to the
loop kernel. This should also be measured.
-- Measure increase in compile time: on native or cross build?
Currently SMS fails to bootstrap trunk on ARM machine. this should
also be taken into account when considering enabling it by default.
Should it be turned on with -O2 or -O3?
SMS flags to use for testing:
-O3 -fmodulo-sched-allow-regmoves -fmodulo-sched
-funsafe-loop-optimizations -fno-auto-inc-dec
Thanks,
Revital
Hi
Some time ago we agreed that not everyone here uses Ubuntu distribution
and decided to provide so called 'generic linux' cross toolchain.
Recently I managed to get it done and now need brave testers to tell is
it working or not.
Get it here: http://people.linaro.org/~hrw/generic-linux/ (64bit only)
Needed files are toolchain-11.07.tar.xz and init.sh script. Unpack
tarball from / so /opt/linaro/11.07/ will be populated and put init.sh
anywhere you want (it will be integrated into tarball later).
How to use:
$ source init.sh
this will add cross toolchain into PATH and also set LD_LIBRARY_PATH to
two directories:
- one with binutils libraries
- second with all extra libraries which may be needed
Feel free to experiment with second dir by removing files from there and
checking are system provided libs are fine too.
So far I checked this toolchain under few distributions:
- Ubuntu 10.04 'lucid' LTS
- Ubuntu 11.04 'natty'
- Fedora 14
- OpenSUSE 11.4
- CentOS 5.6
It failed only under CentOS (which was expected due to it's age).
How did I checked? So far compilation of 'gpm' and 'zlib' were tested.
==GCC==
===Progress===
* Continue to look at the test failure with mvectorize-with-neon-quad.
Should be able to commit the backend workaround in on Monday .
* Having some problems getting my panda board working reliably. I'm
not sure if its the temperature or what but when it gets hot in the
office as it was on Tuesday keeping it working reliably is hard. The
board locks up and then crashes quite often.
* Looked at VFP moves again for some more time.
* Committed tbh range change.
* Committed fixes for PR50022
=== Plans ===
* Finish off VFP moves patch.
* Look at BRANCH_COST results.
* Breakdown the T2 performance blueprints into smaller blueprints.
* Backport tbh range changes to Linaro 4.6
* Test the intrinsics patch once with some more intrinsics tests and
then merge it in to Linaro gcc 4.6
Meetings:
* 1-1s
* TCWG calls
Absences.
* 29th Aug - Sept. 2 - Holiday booked and approved.
* 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel
to be booked.
Hi all,
I'm having real trouble here :(
I just can't seem to get bzr to work! I've tried to branch
gcc-linaro/4.6 again and again, and it just won't. My other machine
refuses to do the merge from lp:gcc/4.6, presumable because the bzr on
there is too old.
I'm stuck. Can anybody else do the merge from upstream?
I'm going to keep trying.
Andrew