I had a play with the vecotiser to see how peeling, unrolling, and
alignment affected the performance of simple memory bound loops.
The short story is:
* For fixed length loops, don't peel
* Performance is the same for 8 byte aligned arrays and up
* Performance is very similar for unaliged arrays
* vld1 is as fast as vldmia
* vld1 with specified alignment is much faster than vld1
The loop is the rather ugly and artifical::
void op(struct ains * __restrict out, const struct aints * __restrict in)
{
for (int i = 0; i < COUNT; i++)
{
out->v[i] = (in->v[i] * 173) | in->v[i];
}
}
where `struct aints` is a aligned structure. I couldn't figure out how
to use an aligned typedef of ints without still introducing a runtime
check. I assume I was running into some type of runtime alias
checking.
This compiled into::
vmov.i32 q10, #173
add r3, r0, #5
0:
vldmia r1!, {d16-d17}
vmul.i32 q9, q8, q10
vorr q8, q9, q8
vstmia r0!, {d16-d17}
cmp r0, r3
bne 0b
I then lied to the compiler by changing the actual alignment at
runtime. See:
http://people.linaro.org/~michaelh/incoming/runtime-offset.png
The performance didn't change for actual alignments of 8,
16, or 32 bytes.
I then converted the loop into one using vld1 and fed it smaller
alignments. See:
http://people.linaro.org/~michaelh/incoming/small-offsets.png
The throughput falls into two camps: one of alignments
1, 2, or 4 and one of 8, 16, 32. The throughput is very similar for
both camps but has some stange dropoffs at 24 words, around 48 words,
and around 96 words. The terminal throughput at 300 words and above
is within 0.5 %
I then converted the vld1 and vst1 to specifiy an alignment of 64
bits. See:
http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50
words by 14 %. This graph also shows the overhead of the runtime
peeling check. The blue line is the vectoriser version which is
slower to pick up due the greater per call overhead.
I then went back to the vectoriser and changed the alignment of the
struct to cause peeling to turn on and off. See:
http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is
partly due to a fixed count loop turning into a runtime count due to
unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to
be unrolled for loops of <= 64 words and drops off in performance past
around 8 words. When the unrolling finally drops out, performance
increases by 101 %.
Raw results and the test cases are available in
lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs
A graph of all results is at:
http://people.linaro.org/~michaelh/incoming/everything.png
The usual caveats apply: this test was all in L1, only on the A9, and
very artificial.
-- Michael
This email is just a quick summary of what we (Linaro) are
planning in the way of QEMU work to support KVM on ARM Cortex-A15.
The idea is to let people know what's coming up, find out if we've
forgotten anything, and avoid people duplicating work unnecessarily.
Most of this is based on a useful session at the recent 'ARM server
mini-summit' in Orlando (UDS/Linaro Connect) at the beginning of
this month.
The work we're currently proposing to do falls into three parts:
* refactor QEMU's cp15 register handling
At the moment QEMU handles cp15 accesses by calling out to a single
helper function which is an enormous set of nested switch statements
to handle the different coprocessor registers. Access permissions are
checked separately at translate time. This design makes specifying
board-dependent or cpu-dependent registers somewhat painful; it's also
easy for the access permission checks to be out of sync. There is no
support for banked cp15 registers either (needed for trustzone and
virtualisation). We need a better design which lets a board or core
register handler routines for cp15 registers. This will make the code
cleaner and more maintainable as a base for new features.
This isn't strictly a requirement for KVM, but we're going to want
KVM to be able to hand off cp15 accesses to QEMU, and I don't think
that's going to be maintainable or reliable without this refactoring.
(https://blueprints.launchpad.net/qemu-linaro/+spec/cp15-rework)
* A15 system model
Basically a QEMU model of a Versatile-Express with a Cortex-A15
minus the virtualization and LPAE extensions. This needs the
A15 private peripherals (just the GIC in the right place in
the memory map, really; generic timer not required) and the
new memory map version of the vexpress board model, plus some
new cp15 registers. (Bill Carson has already done some patches
in this area but they need a little rework and may have minor
missing pieces.)
https://blueprints.launchpad.net/qemu-linaro/+spec/initial-a15-system-model
* miscellaneous integration work
We're aiming for a reasonable working prototype of A15 guest on
an A15 Fast Model host here; we need to fix at least some of
the bugs which currently mean upstream QEMU doesn't work on ARM hosts,
sort out which kernel and qemu trees we are developing from, and
get things running in our validation lab's continuous integration
setup.
https://blueprints.launchpad.net/qemu-linaro/+spec/qemu-kvm-getting-started
Also on the radar is a fourth piece of work:
* QEMU virtio-mmio support
This is adding support for the 'mmio' virtio transport, which will
allow virtio support in a versatile-express model. We're going to
need this at some point but the current thought is that we want
to do the above listed more important bits of work first...
(The exception would probably be if it turned out that this was
sufficiently useful for making early KVM development easier)
https://blueprints.launchpad.net/qemu-linaro/+spec/add-amba-virtio-support
So, questions:
(1) did we forget something important?
(2) is anybody else already planning to do any of this (or would
like to start)? if so we should coordinate...
(3) is there anything that the kernel folk need/want earlier
rather than later?
thanks
-- PMM
Hi,
Now that upstream trunk is in stage3 and we have a few patches that
won't really make it upstream until stage1 is reopened is it
worthwhile having a new status in the merge requests that moves it
into a to_upstream status . The other option is to have a common
spreadsheet that we keep updating with links to merge requests that
need to be upstreamed .
Thoughts ?
Ramana
PS - Any clue on what's happening with the branch diff bug that's been
open in launchpad forever now ?
Hi,
* Worked on peeling problem in eon (#831094). Wrote a patch that
checks if the number of vector iterations is going to be more than 2,
and disables peeling otherwise. With this patch I see about 1.5%
regression with vectorization (and about 7% without it).
* I am thinking to extend the patch for unknown number of iterations
by creating a run-time check. The threshold could be set by param.
Another option, could be doing it through the cost model, but it's
hard to evaluate costs when misalignments are unknown (and, I think,
the cost model handles known misalignment properly).
* Disabling peeling for low loop bounds also helps with one of EEMBC
benchmarks, for which vectorization with double-words is more
beneficial than with quad-words. It turns out that we are able to
force the alignment for double-words (and, therefore, avoid peeling),
because we check that the required alignment (64 in this case) is less
or equal to BIGGEST_ALIGNMENT, where
arm.h:#define BIGGEST_ALIGNMENT (ARM_DOUBLEWORD_ALIGN ?
DOUBLEWORD_ALIGNMENT : 32)
and
arm.h:#define DOUBLEWORD_ALIGNMENT 64
So, we can never force alignment for 128 bits on ARM. I wonder if
that's a real limitation.
* Proposed three SLP patches to gcc-linaro, and merged two of them.
Ira
Addressing the comments received from Richard and Ayal regarding the
patch to estimate register pressure.
Testing the patch on eembc and libav micro benchmarks.
Looking at the regressions seen with SMS.
== GDB ==
* Ongoing work on support for cross-platform core file generation.
Posted a new design proposal to the mailing list to include not
only "info proc mappings", but *all* "info proc" commands. This
would involve a remote protocol command to read arbitrary proc
files, instead of a specific command to retrieve the memory map.
* Investigated Launchpad bug:
#891970 msp430-gdb segmentation fault with target remote
== GCC ==
* Patch review week.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Worked on adding support for 64-bit NEON integer shifts. I have this
working now, although I'm still not very happy about how the register
allocator chooses which mode to use - it prefers core-registers if the
values start or end in core-regs, even though moving to values to NEON
registers might be more efficient (general 64-bit shifts in core
registers require several instructions). I've also had to mark the CC
register clobbered in all cases, even though it only gets clobbered in
some of them, which might be necessary, but isn't very satisfactory.
The NEON shifts work showed that 32->64 bit extends could be done better
also. This hasn't been a great problem up to now, but the shift amount
(in particular) is typically a 32-bit value and yet needs to be
zero-extended to 64-bit for NEON's purposes. Right now, GCC prefers to
extend the value in core-registers, and then copy it to NEON. This
works, but burns another core-register - a scarce commodity - so I think
it would be better to copy it first, and then extend it after. NEON has
instructions for this, so I'm investigating how to get the compiler to
do it (this is all strictly post-combine, so the usual options are out,
and the register allocator has to be allowed to do it the old way in the
case where core-regs really are the best option, so it's tricky).
Summary:
* Upstream crosstool-ng patches.
* Create windows install package from installjammer.
* Investigate link issues.
Details:
* crosstool-ng patches.
* Patches for newlib extra config, gdb extra config, pch, nls option
are committed to crosstool-NG upstream.
* The dependant library patches are in discussion.
* Learn installjammer and integrate it to scripts to create windows
install package.
* Investigate warning message from link when linking the prebuilt zlib
for migw32 host.
It might be OK with static link, but migh fail with dynamic link on windows.
For i586-mingw32[msvc] host, lots of messages like
libtool: link: Could not determine host path corresponding to ...
For i386-mingw32 host: In addition to the message in i586-mingw32
build, output the following message
*** Warning: linker path does not have real file for library -lz. ...
Plans:
* Build and test.
Absences:
* Nov 29, 30: Trainings.
Thanks!
-Zhenqiang