linaro-toolchain August 2011

linaro-toolchain@lists.linaro.org

29 participants
62 discussions

by Richard Sandiford

== This week == * Wrote some patches to make SMS schedule register moves. They made a significant difference to some libav loops. I'm running a regression test on pwoerpc-ibm-aix5.3.0 and will submit upstream next week if all goes OK. * Looked at why mjpegenc was so much worse with SMS. Turned out to be a register spilling problem. Found that -fira-algorithm=priority avoids the regression and makes several other tests better too. (I just tested that to see whether there was a feasible register allocation for these cases; -fira-algorithm=priority isn't the way to go.) * Saw that the register allocator seemed to be tripping over the XImode "structure" values, and that we still had one vector move per structure element by the time we get to the scheduling passes. Eliminated those with a combination of one fix and one hack. It seemed to avoid the allocation problems. * Patch review (Linaro and upstream). * Backported libgcc visibility fix to 4.6 and 4.5. == Next week == * Submit register-scheduling patch. * Submit memory cost patch (from auto-inc-dec changes) * Possibly submit the auto-inc-dec changes themselves, depending on how the rtx cost discussion goes. Richard

13 years, 10 months

[Activity] Week ending 2011-08-26

by Ramana Radhakrishnan

==GCC== ===Progress=== * Looked at the vectorize_with_neon_quad failure again and decided that I had to handle another case but not convinced that the extra stall we'd get in this case was worth it. In any case it would have been a workaround but Richard Sandiford fixed this by getting df to do the right thing which would have been the right fix. * Backported tbh patch. * Backported conditional execution improvements patch from Jiangning to Linaro 4.6 branch. * Committed the LTO + Neon / Android intrinsics patch. * Panda seems more reliable this week but I suspect that's the room cooling more . * Broke up a few blueprints and marked some as done. * BRANCH_COST results show not a huge variation in SPEC and there are some results that are inconsistent.. Need to run a few benchmarks again Sigh :( . * Finished the A9 scheduler patch for smull and friends and committed upstream and into Linaro 4.6. * Reviewed the shrink-wrapping patch and the widening multiplies patch for a short duration. * Looked at the failures in the "popular embedded benchmark" for sometime with Asa. * Tried one of the ICE patches and that seemed to work just fine with bootstrap on FSF trunk. Need to figure out why this was breaking in the Linaro 4.6 tree. https://bugs.launchpad.net/gcc-linaro/+bug/689887 === Plans === Next Week - Holiday :) Feet not up but walking in what looks like typical bank holiday weather ... Might check email later in the week. Meetings: * 1-1s * TCWG calls * Thumb2 performance call. Absences. * 29th Aug - Sept. 2 - Holiday booked and approved. * 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel to be booked.

13 years, 10 months

[ACTIVITY] 21th - 26th August

by Asa Sandahl

* Investigated the errors in the automotive test and concluded that they are CRC-errors, but not depending on the test case result (non intrusive crc check). We decided these errors need to be cleared out once and for all. Michael and Ramana helping out with continued investigation. * EEMBC run on both Panda and Snowball with gcc4.5.2. Results look reasonable, but Michael will also have a look. I will spend a little more time comparing the results from the two boards. * Started to run SPEC2K on the Panda board. Best Regards Åsa

13 years, 10 months

Effect of SMS register move scheduling

by Richard Sandiford

Following on from yesterday's call about what it would take to enable SMS by default: one of the problems I was seeing with the SMS+IV patch was that we ended up with excessive moves. E.g. a loop such as: void foo (int *__restrict a, int n) { int i; for (i = 0; i < n; i += 2) a[i] = a[i] * a[i + 1]; } would end up being scheduled with an ii of 3, which means that in the ideal case, each loop iteration would take 3 cycles. However, we then added ~8 register moves to the loop in order to satisfy dependencies. Obviously those 8 moves add considerably to the iteration time. I played around with a heuristic to see whether there were enough free slots in the original schedule to accomodate the moves. That avoided the problem, but it was a hack: the moves weren't actually scheduled in those slots. (In current trunk, the moves generated for an instruction are inserted immediately before that instruction.) I mentioned this to Revital, who told me that Mustafa Hagog had tried a more complete approach that really did schedule the moves. That patch was quite old, so I ended up reimplementing the same kind of idea in a slightly different way. (The main functional changes from Mustafa's version were to schedule from the end of the window rather than the start, and to use a cyclic window. E.g. moves for an instruction in row 0 column 0 should be scheduled starting at row ii-1 downwards.) The effect on my flawed libav microbenchmarks was much greater than I imagined. I used the options: -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec The "before" code was from trunk, the "after" code was trunk + the register scheduling patch alone (not the IV patch). Only the tests that have different "before" and "after" code are run. The results were: a3dec before: 500000 runs take 4.68384s after: 500000 runs take 4.61395s speedup: x1.02 aes before: 500000 runs take 20.0523s after: 500000 runs take 16.9722s speedup: x1.18 avs before: 1000000 runs take 15.4698s after: 1000000 runs take 2.23676s speedup: x6.92 dxa before: 2000000 runs take 18.5848s after: 2000000 runs take 4.40607s speedup: x4.22 mjpegenc before: 500000 runs take 28.6987s after: 500000 runs take 7.31342s speedup: x3.92 resample before: 1000000 runs take 10.418s after: 1000000 runs take 1.91016s speedup: x5.45 rgb2rgb-rgb24tobgr16 before: 1000000 runs take 1.60513s after: 1000000 runs take 1.15643s speedup: x1.39 rgb2rgb-yv12touyvy before: 1500000 runs take 3.50122s after: 1500000 runs take 3.49887s speedup: x1 twinvq before: 500000 runs take 0.452423s after: 500000 runs take 0.452454s speedup: x1 Taking resample as an example: before the patch we had an ii of 27, stage count of 6, and 12 vector moves. Vector moves can't be dual issued, and there was only one free slot, so even in theory, this loop takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new registers that we spilled quite a few. After the patch we have an ii of 28, a stage count of 3, and no moves, so in theory, one iteration should take 28 cycles. We also don't spill. So I think the difference really is genuine. (The large difference in moves between ii=27 and ii=28 is because in the ii=27 schedule, a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled with time(B) == time(A) + ii + 1.) I also saw benefits in one test in a "real" benchmark, which I can't post here. Richard

13 years, 10 months

Turn on SMS by default on ARM

by Revital Eres

Hello, Following today performance call (https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-08-23) here are some points raised regarding the steps towards enabling SMS by default: * Benchmarks testing: -- Running benchmarks as EEMBC and SPEC2006 with SMS enabled is crucial to expose loops where SMS degrades the performance. those loops need to be analysed to construct a cost model. -- SMS increases code size by introducing prologue and epilogue to the loop kernel. This should also be measured. -- Measure increase in compile time: on native or cross build? Currently SMS fails to bootstrap trunk on ARM machine. this should also be taken into account when considering enabling it by default. Should it be turned on with -O2 or -O3? SMS flags to use for testing: -O3 -fmodulo-sched-allow-regmoves -fmodulo-sched -funsafe-loop-optimizations -fno-auto-inc-dec Thanks, Revital

13 years, 10 months

Generic Linux cross toolchain for tests

by Marcin Juszkiewicz

Hi Some time ago we agreed that not everyone here uses Ubuntu distribution and decided to provide so called 'generic linux' cross toolchain. Recently I managed to get it done and now need brave testers to tell is it working or not. Get it here: http://people.linaro.org/~hrw/generic-linux/ (64bit only) Needed files are toolchain-11.07.tar.xz and init.sh script. Unpack tarball from / so /opt/linaro/11.07/ will be populated and put init.sh anywhere you want (it will be integrated into tarball later). How to use: $ source init.sh this will add cross toolchain into PATH and also set LD_LIBRARY_PATH to two directories: - one with binutils libraries - second with all extra libraries which may be needed Feel free to experiment with second dir by removing files from there and checking are system provided libs are fine too. So far I checked this toolchain under few distributions: - Ubuntu 10.04 'lucid' LTS - Ubuntu 11.04 'natty' - Fedora 14 - OpenSUSE 11.4 - CentOS 5.6 It failed only under CentOS (which was expected due to it's age). How did I checked? So far compilation of 'gpm' and 'zlib' were tested.

13 years, 10 months

[Activity] Weekly status - w/e 19th August 2011

by Ramana Radhakrishnan

==GCC== ===Progress=== * Continue to look at the test failure with mvectorize-with-neon-quad. Should be able to commit the backend workaround in on Monday . * Having some problems getting my panda board working reliably. I'm not sure if its the temperature or what but when it gets hot in the office as it was on Tuesday keeping it working reliably is hard. The board locks up and then crashes quite often. * Looked at VFP moves again for some more time. * Committed tbh range change. * Committed fixes for PR50022 === Plans === * Finish off VFP moves patch. * Look at BRANCH_COST results. * Breakdown the T2 performance blueprints into smaller blueprints. * Backport tbh range changes to Linaro 4.6 * Test the intrinsics patch once with some more intrinsics tests and then merge it in to Linaro gcc 4.6 Meetings: * 1-1s * TCWG calls Absences. * 29th Aug - Sept. 2 - Holiday booked and approved. * 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel to be booked.

13 years, 10 months

perf call tomorrow.

by Ramana Radhakrishnan

Hi, I've put up the agenda for tomorrow's call here. https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-08-23 Please feel free to add to the agenda if there's something we want to bring up for discussion. cheers Ramana

13 years, 10 months

[ACTIVITY] report week 33

by Peter Maydell

RAG: Red: Amber: OMAP3 patch upstreaming is slower progress than hoped Green: Current Milestones: || || Planned || Estimate || Actual || ||qemu-linaro 2011-08 || 2011-08-18 || 2011-08-18 || 2011-08-18 || Historical Milestones: ||qemu-linaro 2011-04 || 2011-04-21 || 2011-04-21 || 2011-04-21 || ||qemu-linaro 2011-05 || 2011-05-19 || 2011-05-19 || n/a || ||close out 1105 blueprints || 2011-05-28 || 2011-05-28 || 2011-05-19 || ||complete 1111 planning || 2011-05-28 || 2011-05-28 || 2011-05-27 || ||qemu-linaro-2011-06 || 2011-06-16 || 2011-06-16 || 2011-06-16 || ||qemu-linaro-2011-07 || 2011-07-21 || 2011-07-21 || 2011-07-21 || == linaro-qemu-11.11 == * made final release * some work on build failures on some obscure platforms (ppc, etc) == other == * attended KVM Forum / LinuCon NA -- some very productive discussions/ arguments/etc with other qemu developers. Full trip report to follow * travel/accommodation for Linaro Connect Q4.11 now booked * submitted pullreq for misc. ARM devboard device model fixes (pl110, pl061, etc) : now committed Current qemu patch status is tracked here: https://wiki.linaro.org/PeterMaydell/QemuPatchStatus Absences: Oct 30-Nov 04: Linaro Connect Q4.11

13 years, 10 months

GCC 4.6 merge

by Andrew Stubbs

Hi all, I'm having real trouble here :( I just can't seem to get bzr to work! I've tried to branch gcc-linaro/4.6 again and again, and it just won't. My other machine refuses to do the merge from lp:gcc/4.6, presumable because the bzr on there is too old. I'm stuck. Can anybody else do the merge from upstream? I'm going to keep trying. Andrew

13 years, 10 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain August 2011