linaro-toolchain November 2010

linaro-toolchain@lists.linaro.org

27 participants
56 discussions

by Dave Martin

Hi Richard, Recapping on this earlier conversation: http://lists.linaro.org/pipermail/linaro-toolchain/2010-July/000030.html http://lists.linaro.org/pipermail/linaro-toolchain/2010-July/000035.html Is it worth another attempt to make a case to upstream for supporting passing -mimplicit-it=thumb by default to gas? According to my understanding of this issue, my argument would go as follows: * gcc currently estimates the size of asm blocks, rather than determining the size accurately. * gcc cannot guarantee the right answer for asm block size when asm blocks contain directives etc., however use of directives in asm blocks is widespread * gcc cannot guarantee the right answer for asm block size in Thumb-2. gcc conservatively overestimates the size by assuming that each statement of the asm block expands to 4 bytes. * All of Ubuntu lucid and maverick has been built with -mimplicit-it=thumb passwd by default, with no known build or runtime failures arising from this (so size issues aside, we have confidence that the resulting code generation is sound) * -mimplicit-it=thumb -mthumb makes the asm block size estimation unsafe: the asm block can exceed the estimated size even in the absence of directives, which may lead to fixup range errors during assembly. * Following the principles already established for Thumb-2 in general the estimation can be made safe (or, as safe as the established Thumb-2 behaviour) by raising the assumed maximum statement expansion size for asm blocks to 6 bytes, since -mimplicit-it will add as most a single (16-bit) IT instruction to each statement. * The vast majority of all asm blocks are small (< 20 instructions, say), so the overall overestimate in sizes will generally be modest for any given compilation unit. * -mimplicit-it is already _required_ by the Linux kernel and possible other projects. ...so... * With -mimplicit-it=thumb and a 6-byte asm block statement expansion size estimate, we have toolchain behaviour which is as reliable, and as correct, as it is in upstream at present. * Layout of data in the compiler output will be more optimal in some cases, and less optimal in other cases, compared with the the current Thumb-2 behaviour, due to differing asm block size estimates. The exact behaviour will depend on the distribution of conditional instructions within asm blocks. * Taken over a whole compilation unit, the total code size overestimate (and therefore the impact on object layout) will normally be modest, due to the small typical size of asm blocks. * Behaviour for -marm will not be impacted at all. If gcc currently estimated asm block code size accurately, then I could understand upstream's objection; but as it stands it seems to me we wouldn't be making anything worse in practice with the proposed change; and there is no compatibility impact (other than positive impact). Of course, I may have some wrong assumptions here, or there may be some background I'm not aware of... Comments? Cheers ---Dave

14 years, 7 months

GCC Optimization Brain Storming Session

by Andrew Stubbs

Hi All, As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities. So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time. Before we get there, I'd like to have a list of ideas to discuss. Partly so that we don't forget anything, and partly so that people can have a think about them before the day. I'm really looking for bigger picture stuff, rather than individual poor code generation bugs. So here's a few to kick off: * Costs tuning. - GCC 4.6 has a new costs model, but are we making full use of it? - What about optimizing for size? - Do the optimizers take any notice? [1] * Instruction set coverage. - Are there any ARM/Thumb2 instructions that we are not taking advantage of? [2] - Do we test that we use the instructions we do have? [3] * Constant pools - it might be a very handy space optimization to have small functions share one constant pool, but the way the passes work one function at a time makes this hard. (LP:625233) * NEON - There's already a lot of work going on here, and I don't want it to hog all our time, but it might be worth touching on. What else? I'm not the most experienced person with GCC internals, and I'm relatively new to the ARM specific parts of those, so somebody else must be able to come up with something far more exciting! So, please, get brain-storming! Andrew [1] We discovered recently that combine is happy to take two insns and combine them into a pattern that matches a splitter that then explodes into three insns (partly due to being no longer able to generate pseudo-registers). [2] For example, I just wrote a patch to add addw and subw support (not yet submitted). [3] LP:643479 is an example of a case where we don't.

14 years, 7 months

[PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)

by Julian Brown

Hi, Here's a work-in-progress patch which fixes many execution failures seen in big-endian mode when -mvectorize-with-neon-quad is in effect (which is soon to be the default, if we stick to the current plan). But, it's pretty hairy, and I'm not at all convinced it's not working "for the wrong reason" in a couple of places. I'm mainly posting to gauge opinions on what we should do in big-endian mode. This patch works with the assumption that quad-word vectors in big-endian mode are in "vldm" order (i.e. with constituent double-words in little-endian order: see previous discussions). But, that's pretty confusing, leads to less than optimal code, and is bound to cause more problems in the future. So I'm not sure how much effort to expend on making it work right, given that we might be throwing that vector ordering away in the future (at least in some cases: see below). The "problem" patterns are as follows. * Full-vector shifts: these don't work with big-endian vldm-order quad vectors. For now, I've disabled them, although they could potentially be implemented using vtbl (at some cost). * Widening moves (unpacks) & widening multiplies: when widening from D-reg to Q-reg size, we must swap double-words in the result (I've done this with vext). This seems to work fine, but what "hi" and "lo" refer to is rather muddled (in my head!). Also they should be expanders instead of emitting multiple assembler insns. * Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as suggested by Paul). These seem to work fine. * Reduction operations: when reducing Q-reg values, GCC currently tries to extract the result from the "wrong half" of the reduced vector. The fix in the attached patch is rather dubious, but seems to work (I'd like to understand why better). We can sort those bits out, but the question is, do we want to go that route? Vectors are used in three quite distinct ways by GCC: 1. By the vectorizer. 2. By the NEON intrinsics. 3. By the "generic vector" support. For the first of these, I think we can get away with changing the vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so that vector registers will hold elements in memory order -- so, all the contortions in the attached patch will be unnecessary. ABI issues are irrelevant, since vectors are "invisible" at the source code layer generally, including at ABI boundaries. For the second, intrinsics, we should do exactly what the user requests: so, vectors are essentially treated as opaque objects. This isn't a problem as such, but might mean that instruction patterns written using "canonical" RTL for the vectorizer can't be shared with intrinsics when the order of elements matters. (I'm not sure how many patterns this would refer to at present; possibly none.) The third case would continue to use "vldm" ordering, so if users inadvertantly write code such as: res = vaddq_u32 (*foo, bar); instead of writing an explicit vld* intrinsic (for the load of *foo), the result might be different from what they expect. It'd be nice to diagnose such code as erroneous, but that's another issue. The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously. So, anyway, back to the patch in question. The choices are, I think: 1. Apply as-is (after I've ironed out the wrinkles), and then remove the "ugly" bits at a later point when vectorizer "array load/store" support is implemented. 2. Apply a version which simply disables all the troublesome patterns until the same support appears. Apologies if I'm retreading old ground ;-). (The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good code for the quad-word vec_pack_trunc_<mode> pattern. It would eventually be applied as a separate patch.) Thoughts? Julian ChangeLog gcc/ * config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode of vector registers. * config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in big-endian mode. (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>) (reduc_umin_<mode>, reduc_umax_<mode>) (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>) (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>) (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle big-endian mode for quad-word vectors.

14 years, 7 months

[ACTIVITY] November 21-25

by Ira Rosen

Hi, - the struggle with the board took a lot of time - continued to investigate special loads/stores - looked for benchmarks: EEMBC Consumer filters rgbcmy and rgbyiq should be vectorizable once vld3, vst3/4 are supported EEMBC Telecom viterbi is supposed to give 4x on NEON once vectorized (according to http://www.jp.arm.com/event/pdf/forum2007/t1-5.pdf slide 29). My old version of viterbi is not vectorizable because of if-conversion problems. I'd be really happy to check the new version (it is supposed to be slightly different). Looking into other EEMBC benchmarks. FFMPEG http://www.ffmpeg.org/ (got this from Rony Nandy from User Platforms). It contains hand-vectorized code for NEON. Investigating. I am probably taking a day off on Sunday. Ira

14 years, 7 months

microoptimising atomic memory ops

by Peter Maydell

This wiki page came up during the toolchain call: https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/ It gives the code generated for __sync_val_compare_and_swap as including a push {r4} / pop {r4} pair because it uses too many temporaries to fit them all in callee-saves registers. I think you can tweak it a bit to get rid of that: # int __sync_val_compare_and_swap (int *mem, int old, int new); # if the current value of *mem is old, then write new into *mem # r0: mem, r1 old, r2 new mov r3, r0 # move r0 into r3 dmb sy # full memory barrier .LSYT7: ldrex r0, [r3] # load (exclusive) from memory pointed to by r3 into r0 cmp r0, r1 # compare contents of r0 (mem) with r1 (old) -> updates the condition flag bne .LSYB7 # branch to LSYB7 if mem != old # This strex trashes the r0 we just loaded, but since we didn't take # the branch we know that r0 == r1 strex r0, r2, [r3] # store r2 (new) into memory pointed to by r3 (mem) # r0 contains 0 if the store was successful, otherwise 1 teq r0, #0 # compares contents of r0 with zero -> updates the condition flag bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the store wasn't successful) # Move the value that was in memory into the right register to return it mov r0, r1 dmb sy # full memory barrier .LSYB7: bx lr # return I think you can do a similar trick with __sync_fetch_and_add (although you have to use a subtract to regenerate r0 from r1 and r2). On the other hand I just looked at the gcc code that does this and it's not simply dumping canned sequences out to the assembler, so maybe it's not worth the effort just to drop a stack push/pop. -- PMM

14 years, 7 months

[ACTIVITY] November 22nd-28th

by Julian Brown

== Linaro GCC == * Finished testing patch for lp675347 (QT inline-asm atomics), and send upstream for comments (no response yet). Suggested reverting a patch (which enabled -fstrict-volatile-bitfields by default on ARM) locally for Ubuntu in the bug log. * Continued working on NEON quad-word vectors/big-endian patch. This turned out to be slightly fiddlier than I expected: I think I now have semantics which make sense, though my patch requires (a) slight middle-end changes, and (b) workarounds for unexpected combiner behaviour re: subregs & sign/zero-extend ops. I will send a new version of the patch to linaro-toolchain fairly soon for comments.

14 years, 7 months

A question about thumb2 cbnz/cbz implementation in thumb2.md

by Revital1 Eres

Hello, I have a question about cbnz/cbz thumb-2 instruction implementation in thumb2.md file: I have an example where we jump to a label which appears before the branch; for example: L4 ... cmp r3, 0 bne .L4 It seems that cbnz instruction should be applied in this loop; replacing cmp+bne; however, cbnz fails to be applied as diff = ADDRESS (L4) - ADDRESS (bne .L4) is negative and according to thumb2_cbz in thumb2.md it should be 2<=diff<=128 (please see snippet below taken from thumb2_cbz). So I want to double check if the current implementation of thumb2_cbnz in thumb2.md needs to be changed to enable it. The following is from thumb2_cbnz in thumb2.md: [(set (attr "length") (if_then_else (and (ge (minus (match_dup 1) (pc)) (const_int 2)) (le (minus (match_dup 1) (pc)) (const_int 128)) (eq (symbol_ref ("which_alternative")) (const_int 0))) (const_int 2) (const_int 8)))] Thanks, Revital

14 years, 7 months

[ACTIVITY] Weekly status

by Richard Sandiford

== This week == * More ARM testing of binutils support for STT_GNU_IFUNC. * Implemented the GLIBC support for STT_GNU_IFUNC. Simple ARM testcases seem to run correctly. * Ran the GLIBC testsuite -- which includes some ifunc coverage -- but haven't analysed the results yet. * Started looking at Thumb for STT_GNU_IFUNC. The problem is that BFD internally represents Thumb symbols with an even value and a special st_type (STT_ARM_TFUNC); this is also the old, pre-EABI external representation. We need something different for STT_GNU_IFUNC. * Tried making BFD represent Thumb symbols as odd-value functions internally. I got it to "work", but I wasn't really happy with the results. * Looked at alternatives, and in the end decided that it would be better to have extra internal-only information in Elf_Internal_Sym. This "works" too, and seems cleaner to me. Sent an RFC upstream: http://sourceware.org/ml/binutils/2010-11/msg00475.html * Started writing some Thumb tests for STT_GNU_IFUNC. * Investigated #618684. Turned out to be something that Bernd had already fixed upstream. Tested a backport. == Next week == * More IFUNC tests (ARM and Thumb, EGLIBC and binutils). Richard

14 years, 7 months

[ACTIVITY] Nov.22 -- Nov.28

by Chung-Lin Tang

== Linaro and upstream GCC == * LP #674146, dpkg segfaults during debootstrap on natty armel: analyzed and found this should be a case of PR44768, backported mainline revision 161947 to fix this. * LP #641379, bitfields poorly optimized. Discussed some with Andrew Stubbs. * GCC bugzilla PR45416: Code generation regression on ARM. Been looking at this regression, that started from the expand from SSA changes since 4.5-experimental. The problem seems to be TER not properly being substituted during expand (compared to prior "convert to GENERIC then expand"). I now have a patch for this, which fixes the PR's testcase, but when testing current upstream trunk, hit an assert fail ICE on several testcases in the alias-oracle; it does however, test without regressions on a 4.5 based compiler. I am still looking at the upstream failures. == This week == * Continue with GCC issues and PRs. * Think about GCC performance opportunities (Linaro)

14 years, 7 months

[ACTIVITY] Nov. 22 -- Nov. 28

by Zach Welch

== Last Week == * Started writing libunwind support for ARM-specific unwinding, but realized that the native ARM toolchain may be causing problems. Spent time trying to isolate the issues, but haven't found the culprit yet. * Began to consider the possibility of developing an ARM-specific unwinding library which would integrate into libunwind (and be reusable elsewhere). Basically, the ARM.ex{idx,tbl} sections are unique to ARM. * Looked at other applications where unwinding functionality already exists to see how ARM unwinding is done (e.g. GDB). Could/should that functionality be replaced with calls to libunwind (or to the aforementioned ARM-specific helper library)? == This Week == * Continue to implement ARM-specific unwinding in libunwind. -- Zach Welch CodeSourcery zwelch(a)codesourcery.com (650) 331-3385 x743

14 years, 7 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain November 2010