linaro-toolchain November 2011

linaro-toolchain@lists.linaro.org

31 participants
53 discussions

Effect of alignment and peeling on vectorised loops

by Michael Hope

I had a play with the vecotiser to see how peeling, unrolling, and alignment affected the performance of simple memory bound loops. The short story is: * For fixed length loops, don't peel * Performance is the same for 8 byte aligned arrays and up * Performance is very similar for unaliged arrays * vld1 is as fast as vldmia * vld1 with specified alignment is much faster than vld1 The loop is the rather ugly and artifical:: void op(struct ains * __restrict out, const struct aints * __restrict in) { for (int i = 0; i < COUNT; i++) { out->v[i] = (in->v[i] * 173) | in->v[i]; } } where `struct aints` is a aligned structure. I couldn't figure out how to use an aligned typedef of ints without still introducing a runtime check. I assume I was running into some type of runtime alias checking. This compiled into:: vmov.i32 q10, #173 add r3, r0, #5 0: vldmia r1!, {d16-d17} vmul.i32 q9, q8, q10 vorr q8, q9, q8 vstmia r0!, {d16-d17} cmp r0, r3 bne 0b I then lied to the compiler by changing the actual alignment at runtime. See: http://people.linaro.org/~michaelh/incoming/runtime-offset.png The performance didn't change for actual alignments of 8, 16, or 32 bytes. I then converted the loop into one using vld1 and fed it smaller alignments. See: http://people.linaro.org/~michaelh/incoming/small-offsets.png The throughput falls into two camps: one of alignments 1, 2, or 4 and one of 8, 16, 32. The throughput is very similar for both camps but has some stange dropoffs at 24 words, around 48 words, and around 96 words. The terminal throughput at 300 words and above is within 0.5 % I then converted the vld1 and vst1 to specifiy an alignment of 64 bits. See: http://people.linaro.org/~michaelh/incoming/set-alignment.png This improved the throughput in all cases and in cases for more than 50 words by 14 %. This graph also shows the overhead of the runtime peeling check. The blue line is the vectoriser version which is slower to pick up due the greater per call overhead. I then went back to the vectoriser and changed the alignment of the struct to cause peeling to turn on and off. See: http://people.linaro.org/~michaelh/incoming/unroll.png At 200 words, the version without peeling is 2.9 % faster. This is partly due to a fixed count loop turning into a runtime count due to unknown alignment. This run also showed the affect of loop unrolling. The loop seems to be unrolled for loops of <= 64 words and drops off in performance past around 8 words. When the unrolling finally drops out, performance increases by 101 %. Raw results and the test cases are available in lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs A graph of all results is at: http://people.linaro.org/~michaelh/incoming/everything.png The usual caveats apply: this test was all in L1, only on the A9, and very artificial. -- Michael

13 years, 8 months

gcc4.6,how to remove werror

by tknv

Hello,When I compile armel kernel by gcc4.6 on Linux du 3.0.0-12-generic-pae #20-Ubuntu SMP Fri Oct 7 16:37:17 UTC 2011 i686 i686 i386 GNU/Linux. tknv@du:~$ arm-linux-gnueabi-gcc -v Using built-in specs. COLLECT_GCC=arm-linux-gnueabi-gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabi/4.6.1/lto-wrapper Target: arm-linux-gnueabi Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/arm-linux-gnueabi/include/c++/4.6.1 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=softfp --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=arm-linux-gnueabi --program-prefix=arm-linux-gnueabi- --includedir=/usr/arm-linux-gnueabi/include --with-headers=/usr/arm-linux-gnueabi/include --with-libs=/usr/arm-linux-gnueabi/lib Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) and Makefile KBUILD_CFLAGS := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \ -fno-strict-aliasing -fno-common \ -Wno-unused-but-set-variable \ -Wno-unused-parameter \ -Wno-array-bounds \ -Wno-format-security \ -fno-delete-null-pointer-checks and make WERROR=0 but error: array subscript is above array bounds [-Werror=array-bounds] cc1: all warnings being treated as errors How to remove werror all ? thanks. -- w.tknv/

13 years, 8 months

plans for QEMU support for KVM on ARM

by Peter Maydell

This email is just a quick summary of what we (Linaro) are planning in the way of QEMU work to support KVM on ARM Cortex-A15. The idea is to let people know what's coming up, find out if we've forgotten anything, and avoid people duplicating work unnecessarily. Most of this is based on a useful session at the recent 'ARM server mini-summit' in Orlando (UDS/Linaro Connect) at the beginning of this month. The work we're currently proposing to do falls into three parts: * refactor QEMU's cp15 register handling At the moment QEMU handles cp15 accesses by calling out to a single helper function which is an enormous set of nested switch statements to handle the different coprocessor registers. Access permissions are checked separately at translate time. This design makes specifying board-dependent or cpu-dependent registers somewhat painful; it's also easy for the access permission checks to be out of sync. There is no support for banked cp15 registers either (needed for trustzone and virtualisation). We need a better design which lets a board or core register handler routines for cp15 registers. This will make the code cleaner and more maintainable as a base for new features. This isn't strictly a requirement for KVM, but we're going to want KVM to be able to hand off cp15 accesses to QEMU, and I don't think that's going to be maintainable or reliable without this refactoring. (https://blueprints.launchpad.net/qemu-linaro/+spec/cp15-rework) * A15 system model Basically a QEMU model of a Versatile-Express with a Cortex-A15 minus the virtualization and LPAE extensions. This needs the A15 private peripherals (just the GIC in the right place in the memory map, really; generic timer not required) and the new memory map version of the vexpress board model, plus some new cp15 registers. (Bill Carson has already done some patches in this area but they need a little rework and may have minor missing pieces.) https://blueprints.launchpad.net/qemu-linaro/+spec/initial-a15-system-model * miscellaneous integration work We're aiming for a reasonable working prototype of A15 guest on an A15 Fast Model host here; we need to fix at least some of the bugs which currently mean upstream QEMU doesn't work on ARM hosts, sort out which kernel and qemu trees we are developing from, and get things running in our validation lab's continuous integration setup. https://blueprints.launchpad.net/qemu-linaro/+spec/qemu-kvm-getting-started Also on the radar is a fourth piece of work: * QEMU virtio-mmio support This is adding support for the 'mmio' virtio transport, which will allow virtio support in a versatile-express model. We're going to need this at some point but the current thought is that we want to do the above listed more important bits of work first... (The exception would probably be if it turned out that this was sufficiently useful for making early KVM development easier) https://blueprints.launchpad.net/qemu-linaro/+spec/add-amba-virtio-support So, questions: (1) did we forget something important? (2) is anybody else already planning to do any of this (or would like to start)? if so we should coordinate... (3) is there anything that the kernel folk need/want earlier rather than later? thanks -- PMM

13 years, 8 months

launchpad / merge requests and upstreaming patches.

by Ramana Radhakrishnan

Hi, Now that upstream trunk is in stage3 and we have a few patches that won't really make it upstream until stage1 is reopened is it worthwhile having a new status in the merge requests that moves it into a to_upstream status . The other option is to have a common spreadsheet that we keep updating with links to merge requests that need to be upstreamed . Thoughts ? Ramana PS - Any clue on what's happening with the branch diff bug that's been open in launchpad forever now ?

13 years, 8 months

[ACTIVITY] November 20-24

by Ira Rosen

Hi, * Worked on peeling problem in eon (#831094). Wrote a patch that checks if the number of vector iterations is going to be more than 2, and disables peeling otherwise. With this patch I see about 1.5% regression with vectorization (and about 7% without it). * I am thinking to extend the patch for unknown number of iterations by creating a run-time check. The threshold could be set by param. Another option, could be doing it through the cost model, but it's hard to evaluate costs when misalignments are unknown (and, I think, the cost model handles known misalignment properly). * Disabling peeling for low loop bounds also helps with one of EEMBC benchmarks, for which vectorization with double-words is more beneficial than with quad-words. It turns out that we are able to force the alignment for double-words (and, therefore, avoid peeling), because we check that the required alignment (64 in this case) is less or equal to BIGGEST_ALIGNMENT, where arm.h:#define BIGGEST_ALIGNMENT (ARM_DOUBLEWORD_ALIGN ? DOUBLEWORD_ALIGNMENT : 32) and arm.h:#define DOUBLEWORD_ALIGNMENT 64 So, we can never force alignment for 128 bits on ARM. I wonder if that's a real limitation. * Proposed three SLP patches to gcc-linaro, and merged two of them. Ira

13 years, 8 months

[ACTIVITY] weekly status

by Revital Eres

Addressing the comments received from Richard and Ayal regarding the patch to estimate register pressure. Testing the patch on eembc and libav micro benchmarks. Looking at the regressions seen with SMS.

13 years, 8 months

[ACTIVITY] report week 47

by Peter Maydell

RAG: Red: Amber: Green: KVM/QEMU work blueprints set up Current Milestones: || || Planned || Estimate || Actual || ||upstream-omap3-cleanup || 2011-11-10 || 2011-12-15 || || ||cp15-rework || 2012-01-06 || || || ||initial-a15-system-model || 2012-01-27 || || || ||qemu-kvm-getting-started || 2012-03-04?|| || || (for blueprint definitions: https://wiki.linaro.org/PeterMaydell/QemuKVM) Historical Milestones: ||add-omap3-networking || 2011-10-13 || 2011-10-13 || 2011-10-13 || ||a15-systemmode-planning || 2011-10-13 || 2011-10-13 || 2011-09-22 || ||a15-usermode-support || 2011-11-10 || 2011-11-10 || 2011-10-27 || == qemu-kvm-getting-started == * sorted out how to cross compile QEMU (involved an upgrade to Oneiric) * documented this and how to put together other required pieces at https://wiki.linaro.org/PeterMaydell/A15OnFastModels * started porting Christoffer's KVM patch forward to current QEMU (compiles, not yet tested) == other == * A15 blueprints etc now sorted -- summary at https://wiki.linaro.org/PeterMaydell/QemuKVM (includes definition of what the above blueprint/milestones are) * upstream patch review (imx.31 board patches, sp804 timer cleanup)

13 years, 8 months

[ACTIVITY] Nov 22 - Nov 25

by Ulrich Weigand

== GDB == * Ongoing work on support for cross-platform core file generation. Posted a new design proposal to the mailing list to include not only "info proc mappings", but *all* "info proc" commands. This would involve a remote protocol command to read arbitrary proc files, instead of a specific command to retrieve the memory map. * Investigated Launchpad bug: #891970 msp430-gdb segmentation fault with target remote == GCC == * Patch review week. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294

13 years, 8 months

[ACTIVITY] 21st - 25th November

by Andrew Stubbs

Worked on adding support for 64-bit NEON integer shifts. I have this working now, although I'm still not very happy about how the register allocator chooses which mode to use - it prefers core-registers if the values start or end in core-regs, even though moving to values to NEON registers might be more efficient (general 64-bit shifts in core registers require several instructions). I've also had to mark the CC register clobbered in all cases, even though it only gets clobbered in some of them, which might be necessary, but isn't very satisfactory. The NEON shifts work showed that 32->64 bit extends could be done better also. This hasn't been a great problem up to now, but the shift amount (in particular) is typically a 32-bit value and yet needs to be zero-extended to 64-bit for NEON's purposes. Right now, GCC prefers to extend the value in core-registers, and then copy it to NEON. This works, but burns another core-register - a scarce commodity - so I think it would be better to copy it first, and then extend it after. NEON has instructions for this, so I'm investigating how to get the compiler to do it (this is all strictly post-combine, so the usual options are out, and the register allocator has to be allowed to do it the old way in the case where core-regs really are the best option, so it's tricky).

13 years, 8 months

[ACTIVITY] WW47

by Zhenqiang Chen

Summary: * Upstream crosstool-ng patches. * Create windows install package from installjammer. * Investigate link issues. Details: * crosstool-ng patches. * Patches for newlib extra config, gdb extra config, pch, nls option are committed to crosstool-NG upstream. * The dependant library patches are in discussion. * Learn installjammer and integrate it to scripts to create windows install package. * Investigate warning message from link when linking the prebuilt zlib for migw32 host. It might be OK with static link, but migh fail with dynamic link on windows. For i586-mingw32[msvc] host, lots of messages like libtool: link: Could not determine host path corresponding to ... For i386-mingw32 host: In addition to the message in i586-mingw32 build, output the following message *** Warning: linker path does not have real file for library -lz. ... Plans: * Build and test. Absences: * Nov 29, 30: Trainings. Thanks! -Zhenqiang

13 years, 8 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain November 2011