For reference. We know that the NEON intrinsics in GCC have issues.
I came across this page:
http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics.
gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate
values on the stack. The core of the loop is:
.L3:
mov ip, r4
vld3.8 {d16-d18}, [r6]
vstmia r4, {d16-d18}
ldmia ip!, {r0, r1, r2, r3}
mov sl, r9
adds r7, r7, #1
adds r6, r6, #24
stmia sl!, {r0, r1, r2, r3}
fldd d16, [sp, #24]
fldd d18, [sp, #32]
ldmia ip, {r0, r1}
vmull.u8 q8, d16, d19
stmia sl, {r0, r1}
vmlal.u8 q8, d18, d20
fldd d18, [sp, #40]
vmlal.u8 q8, d18, d21
vshrn.i16 d16, q8, #8
vst1.8 {d16}, [r5]
adds r5, r5, #8
cmp r8, r7
bgt .L3
llvm-2.9~svn128540 does much better:
vld3.8 {d20, d21, d22}, [r1]!
add r3, r3, #1
cmp r3, r2
vmull.u8 q12, d21, d16
vmlal.u8 q12, d20, d17
vmlal.u8 q12, d22, d18
vshrn.i16 d19, q12, #8
vst1.8 {d19}, [r0]!
blt .LBB0_1
and may actually be better than the had-written assembler on Nils's
page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
-- Michael
Hi there. A reminder that today's call has shifted due to the
European daylight savings change. It's now at 0800 UTC which is 9 am
in the UK, 10 am in central Europe, and 10 am in Israel.
-- Michael
== Last week ==
* PR46934: Thumb-1 ICE, small fix in the "casesi" jump-table expand
code. Quickly approved and committed upstream.
* Enhance XOR patch for gcc/simplify-rtx.c. Updated comments and
committed upstream.
* PR48250 / CS Issue #9845 / Launchpad #723185. Unaligned DImode reload
under NEON. Submitted patch upstream, but still need to do some more
verification that older pre-ARMv5TE cases are safe. Should complete this
week.
* Working on a type of ICE seen currently on upstream trunk, a few
testcases failing under '-O3 -g'. It seems VTA related, but also might
have something to do with register elimination not fully done for
(var_location (entry_value ...)) expressions, leaving [afp+#num] memory
addresses existing in debug insns after reload. Still investigating.
* Launchpad #689887, ICE in get_arm_condition_code(). Pushed a merge
request to Linaro 4.5 for this patch. Also another LP#742961 appeared as
another case of this ICE...
* Still working on (what I think should be) the last of the CoreMark
ARMv6 regressions. The problem is to combine uxtb+cmp into ands #255.
This could be done by adding (set (cc) (compare (zero_extend...)))
patterns, implemented by ands assembly, but still looking if this can be
done (probably more elegantly) by something like CANONICALIZE_COMPARISON
(replacing compare operands) in the ARM backend.
* Launchpad #736007, ICE immed_double_const under -mfpu=neon -g. Some
discussion on gcc-patches about this, still unclear on what should be
done...
== This week ==
* Push forward on above issues.
Committed Dan's RVCT interoperation patch, both upstream and to Linaro
GCC 4.6.
Adjusted Benrd's "Discourage NEON on Cortex-A8" patch following Richard
Earnshaw's comments, and reposted upstream. The new version was
approved, and committed. I've also submitted a merge proposal to Linaro
GCC 4.6.
Dropped Tom's patch for marking smalls strings read-only. This
optimization seems to have no visible effect for ARM in GCC 4.6. I'll
leave it it to Tom to forward-port, if it's still meaningful for MIPS.
Julian has committed the patch for lp:675347, so I've submitted merge
requests to both Linaro GCC 4.5 and 4.6.
Bernd has posted the shrink wrapping patches upstream. I've posted this
info in all the relevant Linaro tracking tickets.
Talked Revital Eres through the Bazaar/Launchpad merge request system.
Tried to understand why GCC 4.6 does not use multiply-and-accumulate
efficiently, when used with 64-bit values. It seems that the compiler
sometimes uses (subreg:SI (reg:DI ...)) and sometimes just uses a plain
(reg:SI ..) and those don't combine to give useful patterns, but I
haven't got to the bottom of it yet.
Tested an FSF GCC 4.6 snapshot from the 23rd. All well, so I've merged
it to the Linaro GCC 4.6 branch.
* Future Absence
Away Monday 28th to Friday 1st April.
----
Upstream patched requiring review:
* Thumb2 constants:
http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00652.html
* ARM EABI half-precision functions
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00874.html
* ARM Thumb2 Spill Likely tweak
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00880.html
* NEON scheduling patch
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg01431.html
Hi,
== libunwind ==
* modified the extbtl-parser to operate on the DWARF model directly
* this adds support for unwinding call stacks with mixed (DWARF and extbl)
frames on ARM
* did a few other fixes and cleanups
* posted the patches on the libunwind ml
* set up a tree on git.linaro.org
* attended a class on friday
Regards
Ken
== GDB ==
* Completed glibc patch to add ARM unwind tables to system call stubs
(bug #684218), patch committed upstream and backported to Ubuntu glibc.
* Posted kernel patch to fixes GDB inferior calls while stopped in a
restartable system call (bug #615974); waiting for review.
* Ongoing work to fix single-stepping over signal handlers (bug #615978).
* Implemented patch to fix single-stepping across bad ARM/Thumb boundary
(bug #667309); posted to mailing list for comments.
* Contributed two fixes for valgrind on ARM (to enable running GDB under
valgrind); both now accepted mainline.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
== This week ==
* Moved the discussion about the RTL and gimple representation of
strided loads/stores to the gcc@ list. Got some good feedback:
http://gcc.gnu.org/ml/gcc/2011-03/msg00322.html
* Started a subdiscussion about the handling of modes:
http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html
This is a tricky one. I'll add more fuel to the fire next week.
* Committed two GCC patches to clean up the expand interface.
Dealt with the fallout (some expected, but unfortunately some not).
* Submitted two of the patches to improve code generation for
strided load/store intrinsics:
http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01631.htmlhttp://gcc.gnu.org/ml/gcc-patches/2011-03/msg01634.html
* Spent a lot of the week reworking the way the load/store intrinsics
are handled, to fix both correctness and performance bugs. The new
rtl patterns should have the right form for the vectoriser.
Made what feels like good progress, but it's not complete yet.
* Sent separate R_ARM_IRELATIVE patch to glibc, after feedback from
glibc-ports.
* Booked flight and hotel for Budapest summit.
* Pinged unreviewed patches.
== Next week ==
* More intrinsics improvements. I think these are necessary to get good
code out of the vectoriser too.
Richard
== String routines ==
* Wrote a thumb optimised strchr
- As expected it's got nice performance for longer runs but at
sizes <16 bytes it's slower, and a lot of the strchr
calls are very short, so it's probably not of benefit in most cases
( https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?ac…
)
* Wrote a neon-memcpy
- As previously found with memset, it performs well on A8 but
poorly on A9 - it does however do the case where
the source/destination isn't aligned quite well even on A9 ; the vld1
unaligned case works with relatively little penalty.
(it performs comparably to the Bionic implementation - mine is a
bit faster on shorter calls, Bionic is better
on longer uses - I think that's because they've got some careful use
of preloads where I have so far got none).
I'm on holiday up to and including 5th April.
Dave
== GCC ==
Progress:
* Investigated excessive VFP moves . Investigating ways forward.
* Went through some of the test results with 4.6 RC2 upstream - looking
through test results etc.
* Setup SPEC2k6 cross on my Linaro machine.
* Waiting for my new Panda board sometime next week.
* Some small bug fixes upstream. Need to rework a couple of
documentation patches after review.
Plans:
* Continue looking at excessive VFP moves.
* Continue to look at some patches upstream.
* Finish working through Thumb2 speed tickets.
* Set up new Panda board.
* Start looking at DENBench results and identify
potential speed up areas.
Meetings:
* 1-1s
* Linaro toolchain meeting
Absences:
* March 30th (maybe): WC Cricket Semi-final. (Ind v Pak)
* April 15 – 26 -> Booked Holiday.
* May 9-14 - LDS Budapest