I went to the first QEMU Users Forum in Grenoble last week; this is my impressions and summary of what happened. Sorry if it's a bit TLDR...
== Summary and general observations ==
This was a day long set of talks tacked onto the end of the DATE conference. There were about 40 attendees; the focus of the talks was mostly industrial and academic research QEMU users/hackers (a set of people who use and modify QEMU but who aren't very well represented on the qemu-devel list).
A lot of the talks related to SystemC; at the moment people are rolling their own SystemC<->QEMU bridges. In addition to the usual problems when you try to put two simulation engines together (each of which thinks it should be in control of the world) QEMU doesn't make this easy because it is not very modular and makes the assumption that only one QEMU exists in a process (lots of global variables, no locking, etc).
There was a general perception from attendees that QEMU "development community" is biased towards KVM rather than TCG. I tend to agree with this, but think this is simply because (a) that's where the bulk of the contributors are and (b) people doing TCG related work don't always appear on the mailing list. (The "quick throwaway prototype" approach often used for research doesn't really mesh well with upstream's desire for solid long-term maintainable code, I guess.)
QEMU could certainly be made more convenient for this group of users: greater modularisation and provision of "just the instruction set simulator" as a pluggable library, for instance. Also the work by STMicroelectronics on tracing/instrumentation plugins looks like it should be useful to reduce the need to hack extra instrumentation directly into QEMU's frontends.
People generally seemed to think the forum was useful, but it hasn't been decided yet whether to repeat it next year, or perhaps to have some sort of joint event with the open-source qemu community.
More detailed notes on each of the talks are below; the proceedings/slides should also appear at http://adt.cs.upb.de/quf within a few weeks. Of particular Linaro/ARM interest are: * the STMicroelectronics plugin framework so your DLL can get callbacks on interesting events and/or insert tracing or instrumentation into generated code * Nokia's work on getting useful timing/power type estimates out of QEMU by measuring key events (insn exec, cache miss, TLB miss, etc) and calibrating against real hardware to see how to weight these * a talk on parallelising QEMU, ie "multicore on multicore" * speeding up Neon by adding SIMD IR ops and translating to SSE
The forum started with a brief introduction by the organiser, followed by an informal Q&A session with Nathan Froyd from CodeSourcery (...since his laptop with his presentation slides had died on the journey over from the US...)
== Talk 1: QEMU and SystemC ==
M. Monton from GreenSocs presented a couple of approaches to using QEMU with SystemC. "QEMU-SC" is for systems which are mostly QEMU based with one or two SystemC devices -- QEMU is the master. Speed penalty is 8-14% over implementing the device natively. "QBox" makes the SystemC simulation the master, and QEMU is implemented as a TLM2 Initiator; this works for systems which are almost all SystemC and which you just want to add a QEMU core to. Speed penalty 100% (!) although they suspect this is an artifact of the current implementation and could be reduced to more like 25-30%. They'd like to see a unified effort to do SystemC and QEMU integration (you'll note that there are several talks here where the presenters had rolled their own integration). Source available from www.greensocs.com.
== Talk 2: Combined Use of Dynamic Binary Translation and SystemC for Fast and Accurate MPSoc Simulation ==
Description of a system where QEMU is used as the core model in a SystemC simulation of a multiprocessor ARM system. The SystemC side includes models of caches, write buffers and so on; this looked like quite a low level detailed (high overhead) simulation. They simulate multiple clusters of multiple cores, which is tricky with QEMU because it has a design assumption of only one QEMU per process address space (lots of global variables, no locking, etc); they handle this by saving and restoring globals at SystemC synchronisation points, which sounded rather hacky to me. They get timing information out of their model by annotating the TCG intermediate representation ops with new ops indicating number of cycles used, whether to check for Icache/Dcache hit/miss, and so on. Clearly they've put a lot of work into this. They'd like a standalone, reentrant ISS, basically so it's easier to plug into other frameworks like SystemC.
== Talk 3: QEMU/SystemC Cosimulation at Different Abstraction Levels ==
This talk was about modelling an RTOS in SystemC; I have to say I didn't really understand the motivation for doing this. Rather than running an RTOS under emulation, they have a SystemC component which provides the scheduler/mutex type APIs an RTOS would, and then model RTOS tasks as other SystemC components. Some of these SystemC components embed user-mode QEMU, so you can have a combination of native and target-binare RTOS tasks. They're estimating time usage by annotating QEMU translation blocks (but not doing any accounting for cache effects).
== Talk 4: Timing Aspects in QEMU/SystemC Synchronisation ==
Slightly academic-feeling talk about how to handle the problem of trying to run several separate simulations in parallel and keep their timing in sync. (In particular, QEMU and a SystemC world.) If you just alternate running each simulation there is no problem but it's not making best use of the host CPU. If you run them in parallel you can have the problem that sim A wants to send an event to sim B at time T, but sim B has already run past time T. He described a couple of possible approaches, but they were all "if you do this you might still hit the problem but there's a tunable parameter to reduce the probability of something going wrong"; also they only actually implemented the simplest one. In some sense this is really all workarounds for the fact that SystemC is being retrofitted/bolted onto the outside of a QEMU simulation.
== Talk 5: Program Instrumentation with QEMU ==
Presentation by STMicroelectronics, about work they'd done adding instrumentation to QEMU so you can use it for execution trace generation, performance analysis, and profiling-driven optimisation when compiling. It's basically a plugin architecture so you can register hooks to be called at various interesting points (eg every time a TB is executed); there are also translation time hooks so plugins can insert extra code into the IR stream. Because it works at the IR level it's CPU-agnostic. They've used this to do real work like optimising/debugging of the Adobe Flash JIT for ARM. They're hoping to be able to submit this upstream.
I liked this; I think it's a reasonably maintainable approach, and it ought to alleviate the need for hacking extra ops directly into QEMU for instrumentation (which is the approach you see in some of the other presentations). In particular it ought to work well with the Nokia work described in the next talk...
== Talk 6: Using QEMU in Timing Estimation for Mobile Software Development ==
Work by Nokia's research division and Aalto university. This was about getting useful timing estimates out of a QEMU model by adding some instrumentation (instructions executed, cache misses, etc) and then calibrating against real hardware to identify what weightings to apply to each of these (weightings differ for different cores/devices; eg on A8 your estimates are very poor if you don't account for L2 cache misses, but for some other cores TLB misses are more important and adding L2 cache miss instrumentation gives only a small improvement in accuracy.) The cache model is not a proper functional cache model, it's just enough to be able to give cache hit/miss stats. They reckon that three or four key statistics (cache miss, TLB miss, a basic classification of insns into slow or fast) give estimated execution times with about 10% level of inaccuracy; the claim was that this is "feasible for practical usage". Git tree available.
This would be useful in conjunction with the STMicroelectronics instrumentation plugin work; alternatively it might be interesting to do this as a Valgrind plugin, since Valgrind has much more mature support for arbitrary plugins. (Of course as a Valgrind plugin you'd be restricted to running on an ARM host, and you're only measuring one process, not whole-system effects.)
== Talk 7: QEMU in Digital Preservation Strategies ==
A less technical talk from a researcher who's working on the problems of how museums should deal with preserving and conserving "digital artifacts" (OSes, applications, games). There are a lot of reasons why "just run natively" becomes infeasible: media decay, the connector conspiracy, old and dying hardware, APIs and environments becoming unsupported, proprietary file formats and on and on. If you emulate hardware (with QEMU) then you only have to deal with emulating a few (tens of) hardware platforms, rather than hundreds of operating systems or thousands of file formats, so it's the most practical approach. They're working on web interfaces for non-technical users. Most interesting for the QEMU dev community is that they're effectively building up a large set of regression tests (ie images of old OSes and applications) which they are going to be able to run automatic testing on.
== Talk 8: MARSS-x86: QEMU-based Micro-Architectural and Systems Simulator for x86 Multicore Processors ==
This is about using QEMU for microarchitectural level modelling (branch predictor, load/store unit, etc); their target audience is academic researchers. There's an existing x86 pipeline level simulator (PLTsim) but it has problems: it uses Xen for its system simulation so it's hard to get installed (need a custom kernel on the host!), and it doesn't cope with multicore. So they've basically taken PLTsim's pipeline model and ported it into the QEMU system emulation environment. When enabled it replaces the TCG dynamic translation implementation; since the core state is stored in the same structures it is possible to "fast forward" a simulation running under TCG and then switch to "full microarchitecture simulation" for the interesting parts of a benchmark. They get 200-400KIPS.
== Talk 9: Showing and Debugging Haiku with QEMU ==
Haiku is an x86 OS inspired by BeOS. The speaker talked about how they use QEMU for demos and also for kernel and bootloader debugging.
== Talk 10: PQEMU : A parallel system emulator based on QEMU ==
This was a group from a Taiwan university who were essentially claiming to have solved the "multicore on multicore" problem, so you can run a simulated MPx4 ARM core on a quad-core x86 box and have it actually use all the cores. They had some benchmarking graphs which indicated that you do indeed get ~3.x times speedup over emulated single-core, ie your scaling gain isn't swamped in locking overhead. However, the presentation concentrated on the locking required for code generation (which is in my opinion the easy part) and I wasn't really convinced that they'd actually solved all the hard problems in getting the whole system to be multithreaded. ("It only crashes once every hundred runs...") Also their work is based on QEMU 0.12, which is now quite old. We should definitely have a look at the source which they hope to make available in a few months.
== Talk 11: PRoot: A Step Forward for QEMU User-Mode ==
STMicroelectronics again, presenting an alternative to the usual "chroot plus binfmt_misc" approach for running target binaries seamlessly under qemu's linux-user mode. It's a wrapper around qemu which uses ptrace to intercept the syscalls qemu makes to the host; in particular it can add the target-directory prefix to all filesystem access syscalls, and can turn an attempt to exec "/bin/ls" into an exec of "qemu-linux-arm /bin/ls". The advantage over chroot is that it's more flexible and doesn't need root access to set up. They didn't give figures for how much overhead the syscall interception adds, though.
== Talk 12: QEMU TCG Enhancements for Speeding up Emulation of SIMD ==
Simple idea -- make emulation of Neon instructions faster by adding some new SIMD IR ops and then implementing them with SSE instructions in the x86 backend. Some basic benchmarking shows that they can be ten times faster this way. Issues: * what is the best set of "generic" SIMD ops to add to the QEMU IR? * is making Neon faster the best use of resource for speeding up QEMU overall, or should we be looking at parallelism or other problems first? * are there nasty edge cases (flags, corner case input values etc) which would be a pain to handle? Interesting, though, and I think it takes the right general approach (ie not horrifically Neon specific). My feeling is that for this to go upstream it would need uses in two different QEMU front ends (to demonstrate that the ops are generic) and implementations in at least the x86 backend, plus fallback code so backends need not implement the ops; that's a fair bit of work beyond what they've currently implemented.
== Talk 13: A SysML-based Framework with QEMU-SystemC Code Generation ==
This was the last talk, and the speaker ran through it very fast as we were running out of time. They have a code generator for taking a UML description of a device and turning it into SystemC (for VHDL) and C++ (for a QEMU device) and then cosimulating them for verification.
-- PMM
linaro-toolchain@lists.linaro.org