== This week ==
* Got the -fsched-pressure code into a state where it's almost
presentable. Found a few more things to tweak on the way.
Fixed some FIXMEs, notably to honour MAX_SCHED_READY_INSNS.
* More testing on ARM. Tried to get some SPEC2000 results
as well as the usual EEMBC & DENbench, but I'm not sure
how noisy the SPEC ones are.
* More testing on powerpc. Decided that this really isn't a good target
to test on for 4.7 because of the poor choice of pressure classes.
SPEC CPU2006 INT results are reasonable-to-good, but the FP ones
suffer from the fact that we think there are twice as many registers
available for normal FP than there actually are. I'd like to fix this,
but all pressure-estimation bits of GCC suffer from the same problem,
and it's hard to justify as part of Linaro, because it doesn't
apply to ARM.
* Fixed upstream PR 50873 (ICE for NEON misaligned moves). Thanks to
Ramana for the heads-up and analysis.
* Retested and posted the patch for PR 48941 upstream (poor code generated
by the vzip*() and vunzp*() arm_neon.h functions).
Richard
== GDB ==
* Created and published Linaro GDB 7.3-2011.12 release.
* Updated Linaro GDB 7.3 to GDB 7.3.1 code base.
* Implemented support for single-stepping atomic operation
code sequences for ARM (and Thumb) (LP #892008). Checked
in to mainline and Linaro GDB.
* Ongoing work on remote support for "info proc" and core file
generation. Currently yet another solution for the remote
interface has been brought up in mailing list discussions
(support accessing arbitrary files on the remote side, not
just /proc). I'm working on prototyping this suggestion.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Hi Michael,
I have finally managed to complete the release process. It wasn't quite
as smooth as I would have liked, but we seem the have got there!
Notes:
- Ramana's VCVT patch caused an Android problem. This was reverted right
before the release.
- The initial release spin and test went without a hitch.
- There was an additional test failure in the GCC testsuite, but this
turns out to be because the snapshot date "20121201" happens to contain
the string "120". Interestingly, this will also be true for most of 2012.
- The ubutest runs seem to have a some problems: all of the glibc and
python builds have failed with a message about libgcc. Since this has
hit both 4.5 and 4.6 simultaneously I'm assuming it's environmental and
not caused by a new toolchain bug. The rest of the compilation appears fine.
- The benchmarking seems fine on A9, but I couldn't find results for the
others, although the scheduler lists the jobs.
- The upload to Launchpad was somewhat problematic. Uploading 4.5 took
two attempts. Uploading 4.6 failed about 6 times (at 20 minutes or so
each) before I tried from another machine with a faster uplink - that
went first time.
Andrew
The Linaro Toolchain Working Group is pleased to announce the 2011.12
release of both Linaro GCC 4.6 and Linaro GCC 4.5.
Linaro GCC 4.6 2011.12 is the tenth release in the 4.6 series. Based
off the latest GCC 4.6.2+svn181866, it contains a range of vectoriser
performance improvements and general bug fixes.
Interesting changes include:
* Updates to 4.6.2+svn181866
* Generic tuing support for Big-endian platforms.
* SLP support for operations with arbirary numbers of operands.
* SLP support for conditions.
* Pattern recognition support in basic-block SLP.
* Enhancements to mixed-size condition pattern recognition.
* Support for 64bit __sync* primitives on ARM.
* Unaligned block-move support for ARMv7.
* Added Cortex-A15 integer pipeline tuning.
Linaro GCC 4.5 2011.12 is the sixteenth release in the 4.5
series. Based off the latest GCC 4.5.3+svn181877, this is a
maintenance focused release.
Interesting changes in 4.5 include:
* Updates to 4.5.3+svn181877
The source tarballs are available from:
https://launchpad.net/gcc-linaro/+milestone/4.6-2011.12https://launchpad.net/gcc-linaro/+milestone/4.5-2011.12
Downloads are available from the Linaro GCC page on Launchpad:
https://launchpad.net/gcc-linaro
More information on the features and issues are available from the
release page:
https://launchpad.net/gcc-linaro/4.6/4.6-2011.12https://launchpad.net/gcc-linaro/4.5/4.5-2011.12
Mailing list: http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Bugs: https://bugs.launchpad.net/gcc-linaro/
Questions? https://ask.linaro.org/
Interested in commercial support? inquire at support(a)linaro.org
Hi,
- fixed PR 51285
- continued looking at the alignment issue, ran Michael's script with
different options, tested Ramana's preliminary patch for vld1/vst1,
and my "don't peel for low loop bounds" patch
Ira
The Linaro Toolchain Working Group is pleased to announce the
release of Linaro QEMU 2011.12.
Linaro QEMU 2011.12 is the latest monthly release of
qemu-linaro. Based off upstream (trunk) QEMU, it includes a
number of ARM-focused bug fixes and enhancements.
New in this month's release:
- There are no Linaro-specific changes of note in this release
- This release is based on the upstream QEMU 1.0 release.
(Note that future qemu-linaro releases will continue to track
upstream trunk; the release dates for upstream and our
release just happened to be conveniently aligned in this case.)
Known issues:
- Graphics do not work for OMAP3 based models (beagle, overo)
with 11.10 Linaro images.
- This release of qemu-linaro is known not to work on ARM hosts.
(See bugs #883133, #883136)
The source tarball is available at:
https://launchpad.net/qemu-linaro/+milestone/2011.12
More information on Linaro QEMU is available at:
https://launchpad.net/qemu-linaro
The Linaro Toolchain Working Group is pleased to announce the release of
Linaro GDB 7.3.
Linaro GDB 7.3 2011.12 is the fourth release in the 7.3 series. Based off
the latest GDB 7.3.1, it includes a number of ARM-focused bug fixes and
enhancements.
This release contains:
* Update to GDB 7.3.1 code base
* Support single-stepping atomic operations (LDREX/STREX sequences)
The source tarball is available at:
https://launchpad.net/gdb-linaro/+milestone/7.3-2011.12
More information on Linaro GDB is available at:
https://launchpad.net/gdb-linaro
I had a play with the vecotiser to see how peeling, unrolling, and
alignment affected the performance of simple memory bound loops.
The short story is:
* For fixed length loops, don't peel
* Performance is the same for 8 byte aligned arrays and up
* Performance is very similar for unaliged arrays
* vld1 is as fast as vldmia
* vld1 with specified alignment is much faster than vld1
The loop is the rather ugly and artifical::
void op(struct ains * __restrict out, const struct aints * __restrict in)
{
for (int i = 0; i < COUNT; i++)
{
out->v[i] = (in->v[i] * 173) | in->v[i];
}
}
where `struct aints` is a aligned structure. I couldn't figure out how
to use an aligned typedef of ints without still introducing a runtime
check. I assume I was running into some type of runtime alias
checking.
This compiled into::
vmov.i32 q10, #173
add r3, r0, #5
0:
vldmia r1!, {d16-d17}
vmul.i32 q9, q8, q10
vorr q8, q9, q8
vstmia r0!, {d16-d17}
cmp r0, r3
bne 0b
I then lied to the compiler by changing the actual alignment at
runtime. See:
http://people.linaro.org/~michaelh/incoming/runtime-offset.png
The performance didn't change for actual alignments of 8,
16, or 32 bytes.
I then converted the loop into one using vld1 and fed it smaller
alignments. See:
http://people.linaro.org/~michaelh/incoming/small-offsets.png
The throughput falls into two camps: one of alignments
1, 2, or 4 and one of 8, 16, 32. The throughput is very similar for
both camps but has some stange dropoffs at 24 words, around 48 words,
and around 96 words. The terminal throughput at 300 words and above
is within 0.5 %
I then converted the vld1 and vst1 to specifiy an alignment of 64
bits. See:
http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50
words by 14 %. This graph also shows the overhead of the runtime
peeling check. The blue line is the vectoriser version which is
slower to pick up due the greater per call overhead.
I then went back to the vectoriser and changed the alignment of the
struct to cause peeling to turn on and off. See:
http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is
partly due to a fixed count loop turning into a runtime count due to
unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to
be unrolled for loops of <= 64 words and drops off in performance past
around 8 words. When the unrolling finally drops out, performance
increases by 101 %.
Raw results and the test cases are available in
lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs
A graph of all results is at:
http://people.linaro.org/~michaelh/incoming/everything.png
The usual caveats apply: this test was all in L1, only on the A9, and
very artificial.
-- Michael