== Last week ==
* Committed STT_GNU_IFUNC changes to binutils.
* Submitted the STT_GNU_IFUNC changes to GLIBC ports. Got feedback
on Friday, which I'll deal with this week.
* Worked on the expand and rtl-level parts of the load/store lane
representation, with new optabs for each operation. This seems
to be working pretty well, but I still need to make some changes
to the way the existing intrinsics work.
* Wrote a patch to clean up the way we handle optabs during expand,
so that the new optabs mentioned above will need a bit less
cut-&-paste. Submitted upstream. Got some positive feedback.
* Committed testcase for PR rtl-optimization/47166 upstream.
== This week ==
* Deal with GLIBC feedback.
* More load/store lanes.
Richard
* Linaro GCC
Tested and merged both the latest Linaro merge requests, and various bug
fixes to the Shrink Wrap optimization from CS, into Linaro GCC 4.5.
Merged and tested from FSF GCC 4.6.
Richard and Ramana have approved some of my upstream patches! I just
need to wait for stage one so I can commit them upstream. I'll commit
them internally when I get time to do the final integration test.
Continued benchmarking GCC 4.6 with the patches merged from GCC 4.5.
Decided to discard a couple of extra patches since they don't appear to
be of any value.
* Other
On leave Wednesday to Friday playing daddy. :)
* Future Absence
Away Monday 28th to Friday 1st April.
----
Upstream patched requiring review:
* Thumb2 constants:
http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00652.html
* ARM EABI half-precision functions
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00874.html
* ARM Thumb2 Spill Likely tweak
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00880.html
* NEON scheduling patch
http://gcc.gnu.org/ml/gcc-patches/2011-02/msg01431.html
Hey
I'm trying to extend the *link: specs to pass a different
-dynamic-linker depending on the float ABI. But I didn't manage to
build a construct which would preserve the order of the flags; if I do
something like:
%{msoft-float:-dynamic-linker V1} %{mfloat-abi=softfp:-dynamic-linker V2}
Then I get V2 for "-mfloat-abi=softfp -msoft-float" instead of V1.
In gcc/gcc.c I found some docs on spec file syntax; I see one can use
%{S*&T*} and %{S*:X}, but apparently %{S*&T*:X} isn't allowed, so I
can't manipulate the value. I tried to use
%{msoft-float*:-dynamic-linker V1} %{mfloat-abi=softfp*:-dynamic-linker V2}
but that gives the same effect (the msoft-float flags are
grouped together in the original order and put first, then the
mfloat-abi=softfp are grouped together in the original order and put
second).
I didn't manage to get %{msoft-float*:%<msoft-float -dynamic-linker V1}
to work; in fact I didn't get supressions to work.
Any idea?
Thanks!
PS: float-abit=softfp/soft-float are just convenient examples; the
actual target is to use different -dynamic-linker for hard vs soft
float-abi
--
Loïc Minier
I went to the first QEMU Users Forum in Grenoble last week;
this is my impressions and summary of what happened. Sorry if
it's a bit TLDR...
== Summary and general observations ==
This was a day long set of talks tacked onto the end of the DATE
conference. There were about 40 attendees; the focus of the talks was
mostly industrial and academic research QEMU users/hackers (a set of
people who use and modify QEMU but who aren't very well represented on
the qemu-devel list).
A lot of the talks related to SystemC; at the moment people are
rolling their own SystemC<->QEMU bridges. In addition to the usual
problems when you try to put two simulation engines together (each of
which thinks it should be in control of the world) QEMU doesn't make
this easy because it is not very modular and makes the assumption that
only one QEMU exists in a process (lots of global variables, no
locking, etc).
There was a general perception from attendees that QEMU "development
community" is biased towards KVM rather than TCG. I tend to agree with
this, but think this is simply because (a) that's where the bulk of
the contributors are and (b) people doing TCG related work don't
always appear on the mailing list. (The "quick throwaway prototype"
approach often used for research doesn't really mesh well with
upstream's desire for solid long-term maintainable code, I guess.)
QEMU could certainly be made more convenient for this group of users:
greater modularisation and provision of "just the instruction set
simulator" as a pluggable library, for instance. Also the work by
STMicroelectronics on tracing/instrumentation plugins looks like
it should be useful to reduce the need to hack extra instrumentation
directly into QEMU's frontends.
People generally seemed to think the forum was useful, but it hasn't
been decided yet whether to repeat it next year, or perhaps to have
some sort of joint event with the open-source qemu community.
More detailed notes on each of the talks are below;
the proceedings/slides should also appear at http://adt.cs.upb.de/quf
within a few weeks. Of particular Linaro/ARM interest are:
* the STMicroelectronics plugin framework so your DLL can get
callbacks on interesting events and/or insert tracing or
instrumentation into generated code
* Nokia's work on getting useful timing/power type estimates out of
QEMU by measuring key events (insn exec, cache miss, TLB miss, etc)
and calibrating against real hardware to see how to weight these
* a talk on parallelising QEMU, ie "multicore on multicore"
* speeding up Neon by adding SIMD IR ops and translating to SSE
The forum started with a brief introduction by the organiser, followed
by an informal Q&A session with Nathan Froyd from CodeSourcery
(...since his laptop with his presentation slides had died on the
journey over from the US...)
== Talk 1: QEMU and SystemC ==
M. Monton from GreenSocs presented a couple of approaches to using
QEMU with SystemC. "QEMU-SC" is for systems which are mostly QEMU
based with one or two SystemC devices -- QEMU is the master. Speed
penalty is 8-14% over implementing the device natively. "QBox" makes
the SystemC simulation the master, and QEMU is implemented as a TLM2
Initiator; this works for systems which are almost all SystemC and
which you just want to add a QEMU core to. Speed penalty 100% (!)
although they suspect this is an artifact of the current
implementation and could be reduced to more like 25-30%. They'd like
to see a unified effort to do SystemC and QEMU integration (you'll
note that there are several talks here where the presenters had rolled
their own integration). Source available from www.greensocs.com.
== Talk 2: Combined Use of Dynamic Binary Translation and
SystemC for Fast and Accurate MPSoc Simulation ==
Description of a system where QEMU is used as the core model in a
SystemC simulation of a multiprocessor ARM system. The SystemC side
includes models of caches, write buffers and so on; this looked like
quite a low level detailed (high overhead) simulation. They simulate
multiple clusters of multiple cores, which is tricky with QEMU because
it has a design assumption of only one QEMU per process address space
(lots of global variables, no locking, etc); they handle this by
saving and restoring globals at SystemC synchronisation points, which
sounded rather hacky to me. They get timing information out of their
model by annotating the TCG intermediate representation ops with new
ops indicating number of cycles used, whether to check for
Icache/Dcache hit/miss, and so on. Clearly they've put a lot of work
into this. They'd like a standalone, reentrant ISS, basically so it's
easier to plug into other frameworks like SystemC.
== Talk 3: QEMU/SystemC Cosimulation at Different Abstraction Levels ==
This talk was about modelling an RTOS in SystemC; I have to say I
didn't really understand the motivation for doing this. Rather than
running an RTOS under emulation, they have a SystemC component which
provides the scheduler/mutex type APIs an RTOS would, and then model
RTOS tasks as other SystemC components. Some of these SystemC
components embed user-mode QEMU, so you can have a combination of
native and target-binare RTOS tasks. They're estimating time usage by
annotating QEMU translation blocks (but not doing any accounting for
cache effects).
== Talk 4: Timing Aspects in QEMU/SystemC Synchronisation ==
Slightly academic-feeling talk about how to handle the problem of
trying to run several separate simulations in parallel and keep their
timing in sync. (In particular, QEMU and a SystemC world.) If you just
alternate running each simulation there is no problem but it's not
making best use of the host CPU. If you run them in parallel you can
have the problem that sim A wants to send an event to sim B at time T,
but sim B has already run past time T. He described a couple of
possible approaches, but they were all "if you do this you might still
hit the problem but there's a tunable parameter to reduce the
probability of something going wrong"; also they only actually
implemented the simplest one. In some sense this is really all
workarounds for the fact that SystemC is being retrofitted/bolted
onto the outside of a QEMU simulation.
== Talk 5: Program Instrumentation with QEMU ==
Presentation by STMicroelectronics, about work they'd done adding
instrumentation to QEMU so you can use it for execution trace
generation, performance analysis, and profiling-driven optimisation
when compiling. It's basically a plugin architecture so you can
register hooks to be called at various interesting points (eg every
time a TB is executed); there are also translation time hooks so
plugins can insert extra code into the IR stream. Because it works at
the IR level it's CPU-agnostic. They've used this to do real work
like optimising/debugging of the Adobe Flash JIT for ARM. They're
hoping to be able to submit this upstream.
I liked this; I think it's a reasonably maintainable approach, and it
ought to alleviate the need for hacking extra ops directly into QEMU
for instrumentation (which is the approach you see in some of the
other presentations). In particular it ought to work well with the
Nokia work described in the next talk...
== Talk 6: Using QEMU in Timing Estimation for Mobile Software
Development ==
Work by Nokia's research division and Aalto university. This was
about getting useful timing estimates out of a QEMU model by adding
some instrumentation (instructions executed, cache misses, etc) and
then calibrating against real hardware to identify what weightings to
apply to each of these (weightings differ for different cores/devices;
eg on A8 your estimates are very poor if you don't account for L2
cache misses, but for some other cores TLB misses are more important
and adding L2 cache miss instrumentation gives only a small
improvement in accuracy.) The cache model is not a proper functional
cache model, it's just enough to be able to give cache hit/miss stats.
They reckon that three or four key statistics (cache miss, TLB miss, a
basic classification of insns into slow or fast) give estimated
execution times with about 10% level of inaccuracy; the claim was that
this is "feasible for practical usage". Git tree available.
This would be useful in conjunction with the STMicroelectronics
instrumentation plugin work; alternatively it might be interesting
to do this as a Valgrind plugin, since Valgrind has much more
mature support for arbitrary plugins. (Of course as a Valgrind
plugin you'd be restricted to running on an ARM host, and you're
only measuring one process, not whole-system effects.)
== Talk 7: QEMU in Digital Preservation Strategies ==
A less technical talk from a researcher who's working on the problems
of how museums should deal with preserving and conserving "digital
artifacts" (OSes, applications, games). There are a lot of reasons
why "just run natively" becomes infeasible: media decay, the connector
conspiracy, old and dying hardware, APIs and environments becoming
unsupported, proprietary file formats and on and on. If you emulate
hardware (with QEMU) then you only have to deal with emulating a few
(tens of) hardware platforms, rather than hundreds of operating
systems or thousands of file formats, so it's the most practical
approach. They're working on web interfaces for non-technical users.
Most interesting for the QEMU dev community is that they're
effectively building up a large set of regression tests (ie images of
old OSes and applications) which they are going to be able to run
automatic testing on.
== Talk 8: MARSS-x86: QEMU-based Micro-Architectural and Systems
Simulator for x86 Multicore Processors ==
This is about using QEMU for microarchitectural level modelling
(branch predictor, load/store unit, etc); their target audience is
academic researchers. There's an existing x86 pipeline level simulator
(PLTsim) but it has problems: it uses Xen for its system simulation so
it's hard to get installed (need a custom kernel on the host!), and it
doesn't cope with multicore. So they've basically taken PLTsim's
pipeline model and ported it into the QEMU system emulation
environment. When enabled it replaces the TCG dynamic translation
implementation; since the core state is stored in the same structures
it is possible to "fast forward" a simulation running under TCG and
then switch to "full microarchitecture simulation" for the interesting
parts of a benchmark. They get 200-400KIPS.
== Talk 9: Showing and Debugging Haiku with QEMU ==
Haiku is an x86 OS inspired by BeOS. The speaker talked about how they
use QEMU for demos and also for kernel and bootloader debugging.
== Talk 10: PQEMU : A parallel system emulator based on QEMU ==
This was a group from a Taiwan university who were essentially
claiming to have solved the "multicore on multicore" problem, so you
can run a simulated MPx4 ARM core on a quad-core x86 box and have it
actually use all the cores. They had some benchmarking graphs which
indicated that you do indeed get ~3.x times speedup over emulated
single-core, ie your scaling gain isn't swamped in locking overhead.
However, the presentation concentrated on the locking required for
code generation (which is in my opinion the easy part) and I wasn't really
convinced that they'd actually solved all the hard problems in getting
the whole system to be multithreaded. ("It only crashes once every
hundred runs...") Also their work is based on QEMU 0.12, which is now
quite old. We should definitely have a look at the source which they
hope to make available in a few months.
== Talk 11: PRoot: A Step Forward for QEMU User-Mode ==
STMicroelectronics again, presenting an alternative to the usual
"chroot plus binfmt_misc" approach for running target binaries
seamlessly under qemu's linux-user mode. It's a wrapper around qemu
which uses ptrace to intercept the syscalls qemu makes to the host; in
particular it can add the target-directory prefix to all filesystem
access syscalls, and can turn an attempt to exec "/bin/ls" into an
exec of "qemu-linux-arm /bin/ls". The advantage over chroot is that
it's more flexible and doesn't need root access to set up. They didn't
give figures for how much overhead the syscall interception adds,
though.
== Talk 12: QEMU TCG Enhancements for Speeding up Emulation of SIMD ==
Simple idea -- make emulation of Neon instructions faster by adding
some new SIMD IR ops and then implementing them with SSE instructions
in the x86 backend. Some basic benchmarking shows that they can be ten
times faster this way. Issues:
* what is the best set of "generic" SIMD ops to add to the QEMU IR?
* is making Neon faster the best use of resource for speeding up
QEMU overall, or should we be looking at parallelism or other
problems first?
* are there nasty edge cases (flags, corner case input values etc)
which would be a pain to handle?
Interesting, though, and I think it takes the right general approach
(ie not horrifically Neon specific). My feeling is that for this to go
upstream it would need uses in two different QEMU front ends (to
demonstrate that the ops are generic) and implementations in at least
the x86 backend, plus fallback code so backends need not implement the
ops; that's a fair bit of work beyond what they've currently
implemented.
== Talk 13: A SysML-based Framework with QEMU-SystemC Code Generation ==
This was the last talk, and the speaker ran through it very fast as we
were running out of time. They have a code generator for taking a UML
description of a device and turning it into SystemC (for VHDL) and C++
(for a QEMU device) and then cosimulating them for verification.
-- PMM
Hello list,
Recently, Android team is working on integrating Linaro toolchain for
Android and NDK. According to the initial benchmark results[1],
Linaro GCC is competitive comparing to Google toolchain. In the
meanwhile, we are trying to enable gcc-4.5 specific features such as
Graphite and LTO (Link Time Optimization) in order to make the best
choice for Android build system and NDK usage. However, I encountered
a problem about LTO and would like to ask help from toolchain WG.
Assuming Linaro Toolchain for Android is installed in directory
/tmp/android-toolchain-eabi, you can obtain Google's toolchain
benchmark suite by git:
# git clone git://android.git.kernel.org/toolchain/benchmark.git
You have to apply the attached patch in order to make benchmark suite
work[2]. Then, change directory to skia:
# cd benchmark/skia
And build skia bench with LTO enabled:
# ../scripts/bench.py --action=build
--toolchain=/tmp/android-toolchain-eabi --add_cflags="-flto
-user-linker-plugin"
The build process would be interrupted by gcc:
make -j4 --warn-undefined-variables -f ../scripts/build/main.mk
TOOLCHAIN=/tmp/android-toolchain-eabi ADD_CFLAGS="-flto
-user-linker-plugin" build
CPP ARM obj/src/core/Sk64.o <= src/src/core/Sk64.cpp
CPP ARM obj/src/core/SkAlphaRuns.o <= src/src/core/SkAlphaRuns.cpp
CPP ARM obj/src/core/SkBitmap.o <= src/src/core/SkBitmap.cpp
CPP ARM obj/src/core/SkBitmapProcShader.o <= src/src/core/SkBitmapProcShader.cpp
CPP ARM obj/src/core/SkBitmapProcState.o <= src/src/core/SkBitmapProcState.cpp
CPP ARM obj/src/core/SkBitmapProcState_matrixProcs.o <=
src/src/core/SkBitmapProcState_matrixProcs.cpp
src/src/core/SkBitmapProcShader.cpp: In function
'SkShader::CreateBitmapShader(SkBitmap const&, SkShader::TileMode,
SkShader::TileMode, void*, unsigned int)':
src/src/core/SkBitmapProcShader.cpp:243:13: warning: 'color' may be
used uninitialized in this function
CPP ARM obj/src/core/SkBitmapSampler.o <= src/src/core/SkBitmapSampler.cpp
src/src/core/SkBitmapProcState_matrixProcs.cpp:530:1: sorry,
unimplemented: gimple bytecode streams do not support machine specific
builtin functions on this target
...
However, I can get other bench items passed such as cximage, gcstone,
gnugo, mpeg4, webkit, and python.
Can anyone give me some hints to resolve LTO problem? Thanks in advance.
Sincerely,
-jserv
[1] https://wiki.linaro.org/Platform/Android/Toolchain#Reference%20Benchmark
We use the same toolchain benchmark suite as Google compiler team took.
[2] https://wiki.linaro.org/Platform/Android/UpstreamToolchain
== Last week ==
* CoreMark ARMv6/v7 regressions: posted another combine patch upstream,
which was quickly approved and committed. The XOR simplification one is
now approved too, but needs a little more revising of comments before
committing.
* The above two patches now bring CoreMark under -march=armv7-a to very
close of the performance of -march=armv5te. However, a regression where
uxtb+cmp cannot be combined into 'ands ... #255' still causes v7 to lose
slightly. This should be the final issue to solve...
* Launchpad #736007/GCC Bugzilla PR48183: NEON ICE in
emit-rtl.c:immed_double_const() under -g. Posted patch upstream, but
looks like more discussion is needed before we know if this is the
"right" way to do it.
* Launchpad #736661, armel FTBFS (G++ ICE in expand_expr_real_1()).
Looking at this.
* Pinged a few upstream patch submissions.
== This week ==
* Launchpad #723185/CS issue #9845 now assigned to me, start looking at
this.
* Get the XOR patch committed upstream, and the above described uxtb+cmp
issue solved.
* Work on other GCC issues.
Hi there. I have a custom report on top of the Launchpad tickets that
shows how old they are and if they need attention:
http://ex.seabright.co.nz/helpers/tickets/gcc-linaro?group_by=lint
I check this once a day to see how we're doing. It's useful when
deciding which bug to attack next.
-- Michael
== libunwind ==
* Had few discussions with Uli with regard to unwinding.
* Continued to learn about libunwind internals.
* The .ARM.exidx and .ARM.extbl section parser is functional but the
integration into libunwind needs to be improved. Currently there are two
seperate models that hold the informations of the current frame. Since they
are not synchronized the behavior of libunwind is quite unexpected to the
user.
* I started on eliminating the redundancy by removing the model that was
introduced for the extbl support. My goal is to have the parser operate on the
DWARF model directly. In theory this should also allow to mix DWARF- and
extable-frames.
Regards
Ken
== GCC ==
* Started looking at performance regressions. Setting up builds with
EEMBC Denbench and other benchmarks.
* Looked at PR47719 in some detail this week.
* Set up environment on laptop . Fixed PR46788 in 4.6 branch and trunk.
* Discussions regarding armhf, how to maintain Linaro branches -
upstreaming patches etc.
* Looked at a case of performance improvements with VFP stores. I think
it's because we end up allowing PRE_INC and POST_DEC for floating point
mode values because of which there end up being more transfers to and
from the integer core registers.
* Off sick on Monday 14th March 2011.
== Misc ==
* Sorted out travel arrangements for LDS. Waiting for visa now.
== GDB ==
* Ongoing work on glibc patch to add ARM unwind tables to system
call stubs (bug #684218).
* Implemented initial version of a kernel patch that fixes GDB
inferior calls while stopped in a restartable system call
(bug #615974); started discussion with kernel folks.
* Implemented new version of patch to fix single-stepping over
signal handlers (bug #615978) that addresses review comments;
posted to mailing list.
* Verified Linaro GDB patch set can be applied to Ubuntu package.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294