RAG:
Red:
Amber:
Green: 1105 work item status 99% complete with 2 weeks to go
Current Milestones:
| Planned | Estimate | Actual |
qemu-linaro 2011-05 | 2011-05-19 | 2011-05-19 | n/a |
close out 1105 blueprints | 2011-05-28 | 2011-05-28 | |
complete 1111 planning | 2011-05-28 | 2011-05-28 | |
Historical Milestones:
finish qemu-cont-integration | 2011-01-25 | 2011-01-25 | handed off |
first qemu-linaro release | 2011-02-08 | 2011-02-08 | 2011-02-08 |
qemu-linaro 2011-03 | 2011-03-08 | 2011-03-08 | 2011-03-08 |
qemu-linaro 2011-04 | 2011-04-21 | 2011-04-21 | 2011-04-21 |
== merge-correctness-fixes ==
* some of my pending patches have been applied; a number of others are
still under discussion or need further work/testing
== other ==
* We won't be making a qemu-linaro 2011-05 release, since there are no
changes since the 2011-04 release (due to a combination of the Easter
holiday and UDS week).
* Attended UDS
* almost all 1105 work items either complete or confirmed postponed
to next cycle
* Good progress on fleshing out blueprints for next cycle:
https://wiki.linaro.org/PeterMaydell/Qemu1111
Current qemu patch status is tracked here:
https://wiki.linaro.org/PeterMaydell/QemuPatchStatus
Absences:
(maybe) 15-16 August: QEMU/KVM strand at LinuxCon NA, Vancouver
[LinuxCon proper follows on 17-19th]
Last week, Ramana pointed me at an upstream bug report about the
inefficient code that GCC generates for vzip, vuzp and vtrn:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48941
It was filed not longer after the Neon seminar at the summit;
I'm not sure whether that was a coincidence or not.
I attached a patch to the bug last week and will test it this week.
However, a cut-down version shows up another problem that isn't related
specifically to intrinsics. Given:
#include <arm_neon.h>
void foo (float32x4x2_t *__restrict dst, float32x4_t *__restrict src, int n)
{
while (n--)
{
dst[0] = vzipq_f32 (src[0], src[1]);
dst[1] = vzipq_f32 (src[2], src[3]);
dst += 2;
src += 4;
}
}
GCC produces:
cmp r2, #0
bxeq lr
.L3:
vldmia r1, {d16-d17}
vldr d18, [r1, #16]
vldr d19, [r1, #24]
vldr d20, [r1, #32]
vldr d21, [r1, #40]
vldr d22, [r1, #48]
vldr d23, [r1, #56]
add r3, r0, #32
vzip.32 q8, q9
vzip.32 q10, q11
subs r2, r2, #1
vstmia r0, {d16-d19}
add r1, r1, #64
vstmia r3, {d20-d23}
add r0, r0, #64
bne .L3
bx lr
We're missing many auto-increment opportunities here. I think this
is due to the limitations of GCC's auto-inc-dec pass rather than to
a problem in the ARM port itself. I think there are two main areas
for improvement:
- The pass only tries to use auto-incs in cases where there is a
separate addition and memory access. It doesn't try to handle
cases where there are two consecutive memory accesses of the
form *base and *(base + size), even if the address costs make
it clear that post-increments would be a win.
- The pass uses a backward scan rather than a forward scan,
which makes it harder to spot chains of more than two accesses.
FWIW, I've got fairly specific ideas about how to do this.
Unfortunately, the pass is in need of some TLC before it's
easy to make changes. So in terms of work items, how about:
1. Clean up the auto-inc pass so that it's easier to modify
2. Investigate improvements to the pass
3. Submit the changes upstream
4. Backport the changes to the Linaro branches
I wrote some patches for (1) last week.
I'd estimate it's about 2 weeks' work for (1) and (2). (3) and (4)
would hopefully be background tasks. The aim would be for something
like:
.L3:
vldmia r1!, {d16-d17}
vldmia r1!, {d18-d19}
vldmia r1!, {d20-d21}
vldmia r1!, {d22-d23}
vzip.32 q8, q9
vzip.32 q10, q11
subs r2, r2, #1
vstmia r0!, {d16-d19}
vstmia r0!, {d20-d23}
bne .L3
bx lr
This should help with auto-vectorised code, as well as normal core code.
(Combining the vldmias and vstmias is a different topic. The fact that
this particular example could be implemented using one load and one
store is to some extent coincidental.)
Richard
== String routines ==
* Gave up on perf on silverbell and redid it on ursa2; now have a
full set of perf figures and have updated the workload report to show
the spec
binaries that use significant time in libc and the routines they spend
it in; a handful of tests spend very significant amounts of time in
libm.
* Have ltrace results from about 75% of spec - some of the others
are fighting a bit
* Optimised the non-neon memcpy; it's now quite respectable except
in one or two cases (2 byte misaligned, and for some odd reason source
offset
by 8 bytes, destination by 12 is way down on any other combination)
(Current result graphs here
https://wiki.linaro.org/Internal/People/DaveGilbert?action=AttachFile&do=ge…
)
Dave
Hi,
* continued looking into ffmpeg/libavcodec:
- dcadsp.c - the inner loop contains reverse accesses which are not
supported on Neon. I think we can handle them using vrev and vswp.
- a lot of loops have unknown memory stride. I am exploring a
possibility of a combination of scalar loads and vmov into a vector
register, but it is probably too expensive.
* looking into telecom/conven
Ira
== Last week ==
* Launchpad #748138: "ICE in redirect_jump, at jump.c:1443". Related to
shrink-wrap, discussed a bit with Bernd off-list. Sent fix today (Mon.)
to gnu-internal; will need to merge to Linaro.
* CoreMark combine canonicalize compares patch set: bootstrapped and
tested with clean results on powerpc, added comments and updated
upstream submission. Machine independent parts okayed by Jeff Law, now
committed upstream. ARM parts still pending review.
* Compiled back-list of upstream patches, and sent to patches(a)linaro.org
* Traveled to Budapest, Hungary for Linaro Developer Summit on Saturday.
== This week ==
* Linaro Developer Summit at Budapest all week.
== GDB ==
* Committed support for NEON registers in core dumps (bug #615972)
to Linaro GDB (not yet in mainline).
* Investigated root cause of bug #615996 (gdb.cp/templates.exp) and
started exploring ways to fix it.
== GCC ==
* Committed fix for bug #759409 (Profiled bootstrap fails in GCC 4.5)
to FSF GCC 4.5 branch and Linaro GCC 4.5.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Worked on the ARM 16 -> 64-bit multiply-and-accumulate problem. Bernd
kindly provided a prototype patch to help. I've tried to understand what
needs to be done, but I didn't have enough time to get to the bottom of
it. So far, I think I know why the existing code doesn't work, and I
think I have a way forward. It does appear that the real problem ought
to be solved in the tree optimizers, though.
Committed the FSF GCC 4.5.3 merge to the Linaro 4.5 branch. Testing did
not show any trouble.
Matthias requested an additional 4.5 merge to pick up a new bug fix, so
I've done the merge, and submitted the merge request for testing.
Committed Maxim's compound conditionals optimization patch - a merge
from Linaro GCC 4.5.
There was some confusion caused by the lp:gcc-linaro/4.6 branch history
accidentally getting re-written. After some discussion on #bzr I managed
to figure out what happened, posted a warning to linaro-toolchain
mailing list, and changed the branch configuration to prevent it
happening again.
Committed Mark Shinwell's BRANCH_COST patch to Linaro GCC 4.6 - another
merge from GCC 4.5.
Merged from FSF GCC 4.6 to Linaro 4.6 and submitted the patch for testing.
Richard Earnshaw approved my recent Thumb2 constants patch, but only if
I modify it slightly. I've begun work on the changes, but I still need
to test them. I won't be able to commit them until the ADDW/SUBW patch
has been approved.
Ramana has reviewed my EABI half-precision function names patch, and
discovered that the return types are wrong. I have no idea how this
happened - the changes are deliberate so they must have been based on
something, but I no longer have the same documents I had when I did the
work, and it clearly doesn't match my current ones. In any case, the
changes make no practical difference as function return values are
always as wide a register anyway.
* Other
Public holiday on Monday.
* Next week
I will be attending UDS in Budapest from 8th - 14th May. I shall
continue to read my email, but will not be attending any calls.
----
Upstream patched requiring review:
* NEON scheduling patch
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg01431.html
* ARM Thumb2 addw/subw support.
http://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg03783.html
== Bug fighting ==
* Tracked bug 774175 (apt segfault on armel on oneiric) down to the
cortex-a8 branch erratum bug that we found as part of the bug jam a
few weeks
ago (affecting the more obscure vtk package) - Richard's existing
binutils fix should fix this.
== String routines ==
* Struggled to get 'perf' to get sane results from profiling spec;
some of the samples are obviously being associated with the wrong
process somewhere
along the process (e.g. it's showing significant samples in the sh
process but in a library that's used by the actual benchmark.
* latrace on spec still running on ursa2
* Wrote a non-neon memcpy; as expected it's aligned performance is
very similar to libc/kernel - it's a bit faster in some places but
slower
in some odd places (e.g. n*32+1 bytes is a lot slower for some
reason). It's also really bad on mis-aligned cases, I tried to take
advantage
of the v7's ability to do misaligned loads - but they really are quite slow.
Dave
== This week ==
* Committed interleaved load/store vectorisation changes upstream.
* Merged the vldN and vstN intrinsic improvements into Linaro 4.5 and 4.6.
(Thanks for the quick reviews here.)
* Backported the interleaved load/store vectorisation changes to Linaro
4.5 and 4.6. This took a while because the patch series touches
turbulent code. Submitted merge requests.
* Merged Sergey Grechanik's NEON reload improvement into Linaro 4.5
and 4.6.
* Got ready for summit.
Richard