Hi, We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
Any thoughts on this are appreciated.
Yao
On Fri, Sep 03, 2010, Yao Qi wrote:
We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
One of the largest self-contained piece of code which we'll find on all systems which Linaro targets is ... the kernel. It's not always trivial to build it in Thumb mode, but it would seem like a good test case nevertheless. At least if it builds, looking at the resulting size would be a good test, even if it doesn't boot :-)
Loïc Minier wrote:
On Fri, Sep 03, 2010, Yao Qi wrote:
We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
One of the largest self-contained piece of code which we'll find on all systems which Linaro targets is ... the kernel. It's not always trivial to build it in Thumb mode, but it would seem like a good test case nevertheless. At least if it builds, looking at the resulting size would be a good test, even if it doesn't boot :-)
Yeah, kernel is a good test case here, and it is not so hard to compile kernel as thumb2. Linux kernel can be a good evidence if we optimize its size when this work item is done. :)
However, we'd like to probe any possible improvements on code size that we can do. This part is hard, and I have few ideas so far. Thoughts?
Yao Qi wrote:
Hi, We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
Any thoughts on this are appreciated.
I've put some ideas in this wiki page, https://wiki.linaro.org/Internal/People/YaoQi/Thumb2Optimize
On Mon, 6 Sep 2010, Yao Qi wrote:
Yao Qi wrote:
Hi, We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
Any thoughts on this are appreciated.
I've put some ideas in this wiki page, https://wiki.linaro.org/Internal/People/YaoQi/Thumb2Optimize
Your remark for the first example is wrong. GCC has to store r8 (or any other register for that matter) in order to keep the stack pointer 64-bit aligned, as required by EABI.
Nicolas
Nicolas Pitre wrote:
Your remark for the first example is wrong. GCC has to store r8 (or any other register for that matter) in order to keep the stack pointer 64-bit aligned, as required by EABI.
Nicolas, Thanks for letting me know this. I've marked this example as wrong in wiki.
On 06/09/10 07:16, Yao Qi wrote:
I've put some ideas in this wiki page, https://wiki.linaro.org/Internal/People/YaoQi/Thumb2Optimize
We probably shouldn't post Internal links to this public list. Is there any reason this can't be done in the open?
Now for the page content ....
I think you should make clear that we're after _size_ optimizations in this case, if just for readability's sake.
1. This example (regardless of correctness) gains no size improvement.
2. This code is clearly an inlined memset. It might be that a branch instruction with constants and such is not (much) smaller. We should investigate what GCC does for different size writes.
3. This sounds like a nightmare for register allocation, but if you could make it happen then great :)
....
6. Is that an EEMBC function? We can't change those in the source. Are you proposing a -fwhole-program optimization? (Of course, enabling inlining at -Os for trivial functions like this might work without -fwhole-program or LTO, if it's in the same TU.)
Other ideas:
* https://bugs.launchpad.net/gcc-linaro/+bug/625233 * Investigate reduced alignment constraints?
Andrew
Andrew Stubbs wrote:
On 06/09/10 07:16, Yao Qi wrote:
I've put some ideas in this wiki page, https://wiki.linaro.org/Internal/People/YaoQi/Thumb2Optimize
We probably shouldn't post Internal links to this public list. Is there any reason this can't be done in the open?
I've moved this page to a public place https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SizeOptimize
Now for the page content ....
I think you should make clear that we're after _size_ optimizations in this case, if just for readability's sake.
- This example (regardless of correctness) gains no size improvement.
OK, I should remove this one.
- This code is clearly an inlined memset. It might be that a branch
instruction with constants and such is not (much) smaller. We should investigate what GCC does for different size writes.
Yeah, I agree that we should investigate how gcc does for different size.
- This sounds like a nightmare for register allocation, but if you
could make it happen then great :)
....
- Is that an EEMBC function? We can't change those in the source. Are
you proposing a -fwhole-program optimization? (Of course, enabling inlining at -Os for trivial functions like this might work without -fwhole-program or LTO, if it's in the same TU.)
Yes, that is an EEMBC function. Of course, we can't change source code. It is not related to thumb2 code size optimization. I've moved it to another section.
Other ideas:
Add it in this wiki page.
- Investigate reduced alignment constraints?
Any details on this?
On 07/09/10 13:01, Yao Qi wrote:
- Investigate reduced alignment constraints?
Any details on this?
No, I just know that some targets like to align functions to cache-lines. This is a useful speed optimization, but does lead to lots of "blank" gaps in the code. I have no real idea if ARM does this kind of thing, or if the ABI has anything to say about it.
I just suggest that we should check it out - or at least ask an ARM expert if I'm talking nonsense. :)
Andrew
On Tue, 2010-09-07 at 13:09 +0100, Andrew Stubbs wrote:
On 07/09/10 13:01, Yao Qi wrote:
- Investigate reduced alignment constraints?
Any details on this?
No, I just know that some targets like to align functions to cache-lines. This is a useful speed optimization, but does lead to lots of "blank" gaps in the code. I have no real idea if ARM does this kind of thing, or if the ABI has anything to say about it.
I just suggest that we should check it out - or at least ask an ARM expert if I'm talking nonsense. :)
I'm pretty certain we don't do this with gratuitously -Os on ARM. We may, however, align some thumb functions to a 32-bit boundary unnecessarily (still needed if there's a literal pool).
R.
On Mon, 06 Sep 2010 14:16:25 +0800 Yao Qi yao.qi@linaro.org wrote:
Yao Qi wrote:
Hi, We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
Any thoughts on this are appreciated.
I've put some ideas in this wiki page, https://wiki.linaro.org/Internal/People/YaoQi/Thumb2Optimize
People have pointed out problems with your first example already, but there might actually have been something possible to do there (I see you removed it already though!): the problem is that r8 is saved just to maintain 8-byte stack alignment, but that changes the prologue and epilogue push & pop instructions from 2-byte to 4-byte instructions.
I thought this was just an unfortunate corner case which we couldn't do anything about, but maybe it is... could we have pushed an extra low register instead (e.g. r3 instead of r8) to maintain stack alignment? Do you still have the code fragment handy (I don't remember exactly how it went)?
Julian
On Tue, Sep 07, 2010, Julian Brown wrote:
Do
you still have the code fragment handy (I don't remember exactly how it went)?
You can extract it from the wiki history with the "Info" action on the page and then diffing revisions:
1. stmdb/ldmia registers that are not used * Observations {{{ Dump of assembler code for function history_expand_line_internal: 0x00001c1c <+0>: stmdb sp!, {r4, r5, r6, r7, r8, lr} 0x00001c20 <+4>: movs r1, #0 0x00001c22 <+6>: ldr r5, [pc, #52] ; (0x1c58 <history_expand_line_internal+60>) 0x00001c24 <+8>: mov r2, r1 0x00001c26 <+10>: mov r6, r0 0x00001c28 <+12>: ldr r7, [r5, #0] 0x00001c2a <+14>: str r1, [r5, #0] 0x00001c2c <+16>: bl 0x1c2c <history_expand_line_internal+16> 0x00001c30 <+20>: str r7, [r5, #0] 0x00001c32 <+22>: cmp r0, r6 0x00001c34 <+24>: mov r4, r0 0x00001c36 <+26>: bne.n 0x1c52 <history_expand_line_internal+54> 0x00001c38 <+28>: bl 0x1c38 <history_expand_line_internal+28> 0x00001c3c <+32>: ldr r1, [pc, #28] ; (0x1c5c <history_expand_line_internal+64>) 0x00001c3e <+34>: movw r2, #1850 ; 0x73a 0x00001c42 <+38>: adds r0, #1 0x00001c44 <+40>: bl 0x1c44 <history_expand_line_internal+40> 0x00001c48 <+44>: mov r1, r4 0x00001c4a <+46>: ldmia.w sp!, {r4, r5, r6, r7, r8, lr} 0x00001c4e <+50>: b.w 0x1c4e <history_expand_line_internal+50> 0x00001c52 <+54>: ldmia.w sp!, {r4, r5, r6, r7, r8, pc} 0x00001c56 <+58>: nop 0x00001c58 <+60>: andeq r0, r0, r0 0x00001c5c <+64>: andeq r0, r0, r0 }}} Register r8 is not used in this function, so no need to save/restore r8. * Possible improvements
On Tue, 7 Sep 2010 12:55:59 +0200 Loïc Minier loic.minier@linaro.org wrote:
On Tue, Sep 07, 2010, Julian Brown wrote:
Do
you still have the code fragment handy (I don't remember exactly how it went)?
You can extract it from the wiki history with the "Info" action on the page and then diffing revisions:
Oh right, I should have realised that :-).
- stmdb/ldmia registers that are not used
- Observations
{{{ Dump of assembler code for function history_expand_line_internal: 0x00001c1c <+0>: stmdb sp!, {r4, r5, r6, r7, r8, lr}
This could be:
push {r3, r4, r5, r6, r7, lr}
0x00001c20 <+4>: movs r1, #0 0x00001c22 <+6>: ldr r5, [pc, #52] ; (0x1c58 <history_expand_line_internal+60>) 0x00001c24 <+8>: mov r2, r1 0x00001c26 <+10>: mov r6, r0 0x00001c28 <+12>: ldr r7, [r5, #0] 0x00001c2a <+14>: str r1, [r5, #0] 0x00001c2c <+16>: bl 0x1c2c <history_expand_line_internal+16> 0x00001c30 <+20>: str r7, [r5, #0] 0x00001c32 <+22>: cmp r0, r6 0x00001c34 <+24>: mov r4, r0 0x00001c36 <+26>: bne.n 0x1c52 <history_expand_line_internal+54> 0x00001c38 <+28>: bl 0x1c38 <history_expand_line_internal+28> 0x00001c3c <+32>: ldr r1, [pc, #28] ; (0x1c5c <history_expand_line_internal+64>) 0x00001c3e <+34>: movw r2, #1850 ; 0x73a 0x00001c42 <+38>: adds r0, #1 0x00001c44 <+40>: bl 0x1c44 <history_expand_line_internal+40> 0x00001c48 <+44>: mov r1, r4 0x00001c4a <+46>: ldmia.w sp!, {r4, r5, r6, r7, r8, lr}
This must remain a wide instruction...
ldmia.w sp!, {r3, r4, r5, r6, r7, lr}
0x00001c4e <+50>: b.w 0x1c4e <history_expand_line_internal+50> 0x00001c52 <+54>: ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
But this could be:
pop {r3, r4, r5, r6, r7, pc}
0x00001c56 <+58>: nop 0x00001c58 <+60>: andeq r0, r0, r0 0x00001c5c <+64>: andeq r0, r0, r0 }}} Register r8 is not used in this function, so no need to save/restore r8.
- Possible improvements
So yeah, I think there is indeed a possible improvement here (and we don't even need to break the EABI, I don't think). Unless I've overlooked something, anyway...
Julian
Julian Brown wrote:
On Tue, 7 Sep 2010 12:55:59 +0200 Loïc Minier loic.minier@linaro.org wrote:
On Tue, Sep 07, 2010, Julian Brown wrote:
Do
you still have the code fragment handy (I don't remember exactly how it went)?
You can extract it from the wiki history with the "Info" action on the page and then diffing revisions:
Oh right, I should have realised that :-).
So yeah, I think there is indeed a possible improvement here (and we don't even need to break the EABI, I don't think). Unless I've overlooked something, anyway...
Julian, I revert back the first example, and add your comments in it. https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SizeOptimize
In order to teach gcc chooses low register when keeping stack alignment, which part of gcc shall I have a look? Is it about RA or regrename?
On Tue, 07 Sep 2010 21:06:10 +0800 Yao Qi yao.qi@linaro.org wrote:
Julian Brown wrote:
So yeah, I think there is indeed a possible improvement here (and we don't even need to break the EABI, I don't think). Unless I've overlooked something, anyway...
Julian, I revert back the first example, and add your comments in it. https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SizeOptimize
In order to teach gcc chooses low register when keeping stack alignment, which part of gcc shall I have a look? Is it about RA or regrename?
No, all the code to generate prologues & epilogues is target-specific, and happens after register allocation. Take a look at e.g. arm.c:arm_expand_prologue and friends. (Beware though, they can be quite fiddly!)
Julian
This reminds me of a PR that Bernd did: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657
It is also support for adding the r0-r3 registers to the epilogue/prologue push-pop for sake of reducing code size, though in a sense even more aggressive; it tries to merge the local stack allocation SP sub/add with the stm/ldm.
Bernd's patch was for Thumb-1, though I don't see why it can't be implemented for ARM/Thumb-2 too.
Chung-Lin
On 2010/9/7 21:36, Julian Brown wrote:
On Tue, 07 Sep 2010 21:06:10 +0800 Yao Qiyao.qi@linaro.org wrote:
Julian Brown wrote:
So yeah, I think there is indeed a possible improvement here (and we don't even need to break the EABI, I don't think). Unless I've overlooked something, anyway...
Julian, I revert back the first example, and add your comments in it. https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SizeOptimize
In order to teach gcc chooses low register when keeping stack alignment, which part of gcc shall I have a look? Is it about RA or regrename?
No, all the code to generate prologues& epilogues is target-specific, and happens after register allocation. Take a look at e.g. arm.c:arm_expand_prologue and friends. (Beware though, they can be quite fiddly!)
Julian
Chung-Lin Tang wrote:
This reminds me of a PR that Bernd did: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657
It is also support for adding the r0-r3 registers to the epilogue/prologue push-pop for sake of reducing code size, though in a sense even more aggressive; it tries to merge the local stack allocation SP sub/add with the stm/ldm.
Bernd's patch was for Thumb-1, though I don't see why it can't be implemented for ARM/Thumb-2 too.
Chung-Lin, 'Unfortunately', FSF GCC trunk can do this for Thumb2. 1.c is from GCC PR40657.
./fsf-mainline/install/bin/arm-none-linux-gnueabi-gcc -mthumb -mcpu=cortex-a9 -Os 1.c -c -o 1.o
00000000 <foo>: 0: b507 push {r0, r1, r2, lr} 2: a801 add r0, sp, #4 4: f7ff fffe bl 0 <bar> 8: 9801 ldr r0, [sp, #4] a: bd0e pop {r1, r2, r3, pc}
On Tue, 2010-09-07 at 12:24 +0100, Julian Brown wrote:
On Tue, 7 Sep 2010 12:55:59 +0200 Loïc Minier loic.minier@linaro.org wrote:
On Tue, Sep 07, 2010, Julian Brown wrote:
Do
you still have the code fragment handy (I don't remember exactly how it went)?
You can extract it from the wiki history with the "Info" action on the page and then diffing revisions:
Oh right, I should have realised that :-).
- stmdb/ldmia registers that are not used
- Observations
{{{ Dump of assembler code for function history_expand_line_internal: 0x00001c1c <+0>: stmdb sp!, {r4, r5, r6, r7, r8, lr}
This could be:
push {r3, r4, r5, r6, r7, lr}
0x00001c20 <+4>: movs r1, #0 0x00001c22 <+6>: ldr r5, [pc, #52] ; (0x1c58 <history_expand_line_internal+60>) 0x00001c24 <+8>: mov r2, r1 0x00001c26 <+10>: mov r6, r0 0x00001c28 <+12>: ldr r7, [r5, #0] 0x00001c2a <+14>: str r1, [r5, #0] 0x00001c2c <+16>: bl 0x1c2c <history_expand_line_internal+16> 0x00001c30 <+20>: str r7, [r5, #0] 0x00001c32 <+22>: cmp r0, r6 0x00001c34 <+24>: mov r4, r0 0x00001c36 <+26>: bne.n 0x1c52 <history_expand_line_internal+54> 0x00001c38 <+28>: bl 0x1c38 <history_expand_line_internal+28> 0x00001c3c <+32>: ldr r1, [pc, #28] ; (0x1c5c <history_expand_line_internal+64>) 0x00001c3e <+34>: movw r2, #1850 ; 0x73a 0x00001c42 <+38>: adds r0, #1 0x00001c44 <+40>: bl 0x1c44 <history_expand_line_internal+40> 0x00001c48 <+44>: mov r1, r4 0x00001c4a <+46>: ldmia.w sp!, {r4, r5, r6, r7, r8, lr}
This must remain a wide instruction...
ldmia.w sp!, {r3, r4, r5, r6, r7, lr}
0x00001c4e <+50>: b.w 0x1c4e <history_expand_line_internal+50> 0x00001c52 <+54>: ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
But this could be:
pop {r3, r4, r5, r6, r7, pc}
0x00001c56 <+58>: nop 0x00001c58 <+60>: andeq r0, r0, r0 0x00001c5c <+64>: andeq r0, r0, r0 }}} Register r8 is not used in this function, so no need to save/restore r8.
- Possible improvements
So yeah, I think there is indeed a possible improvement here (and we don't even need to break the EABI, I don't think). Unless I've overlooked something, anyway...
GCC 4.5 should already do this:
2009-06-02 Richard Earnshaw rearnsha@arm.com
* arm.c (arm_get_frame_offsets): Prefer using r3 for padding a push/pop multiple to 8-byte alignment.
R.
It would be interesting if we could get a good, representative set of comparative benchmarks for the size and performance impact of -Os.
I did a bit of investigation here:
https://wiki.linaro.org/Platform/Foundations/OptimiseForSize
(though with just a few packages and only one benchmark, it's not very comprehensive)
Cheers ---Dave
It's only part of the puzzle, but I run speed benchmarks as part of the continious build: http://ex.seabright.co.nz/helpers/buildlog http://ex.seabright.co.nz/helpers/benchcompare http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
This might be one way to approach things. It's simple to add other programs into the mix.
-- Michael
On Fri, Sep 17, 2010 at 4:05 AM, Dave Martin dave.martin@linaro.org wrote:
It would be interesting if we could get a good, representative set of comparative benchmarks for the size and performance impact of -Os.
I did a bit of investigation here:
https://wiki.linaro.org/Platform/Foundations/OptimiseForSize
(though with just a few packages and only one benchmark, it's not very comprehensive)
Cheers ---Dave
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Michael Hope wrote:
It's only part of the puzzle, but I run speed benchmarks as part of the continious build: http://ex.seabright.co.nz/helpers/buildlog http://ex.seabright.co.nz/helpers/benchcompare http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
Here are some options we may have to use in our benchmarks, {-Os,-O2} -fno-common -mthumb --mfloat-abi={hard,soft} -mfpu=neon
IIRC, hardfp will increase the code size to some extent.
This might be one way to approach things. It's simple to add other programs into the mix.
-- Michael
On Fri, Sep 17, 2010 at 4:05 AM, Dave Martin dave.martin@linaro.org wrote
On Fri, 2010-09-17 at 18:21 +0800, Yao Qi wrote:
Michael Hope wrote:
It's only part of the puzzle, but I run speed benchmarks as part of the continious build: http://ex.seabright.co.nz/helpers/buildlog http://ex.seabright.co.nz/helpers/benchcompare http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
Here are some options we may have to use in our benchmarks, {-Os,-O2} -fno-common -mthumb --mfloat-abi={hard,soft} -mfpu=neon
IIRC, hardfp will increase the code size to some extent.
hard-float should show a significant code size saving over pure soft-float for anything with floating point code as the compiler will be able to use single instructions for many operations rather than library calls.
R.
Richard Earnshaw wrote:
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
Here are some options we may have to use in our benchmarks, {-Os,-O2} -fno-common -mthumb --mfloat-abi={hard,soft} -mfpu=neon
IIRC, hardfp will increase the code size to some extent.
hard-float should show a significant code size saving over pure soft-float for anything with floating point code as the compiler will be able to use single instructions for many operations rather than library calls.
Richard, You are right. I run EEMBC again with softfp and hardfp, and result shows hardfp saves size from 2% to 35%. Thanks for your clarification.
Hi,
On Fri, Sep 17, 2010 at 3:50 AM, Michael Hope michael.hope@linaro.org wrote:
It's only part of the puzzle, but I run speed benchmarks as part of the continious build: http://ex.seabright.co.nz/helpers/buildlog http://ex.seabright.co.nz/helpers/benchcompare http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
This might be one way to approach things. It's simple to add other programs into the mix.
Could you easily add code size metrics?
It would be useful to watch those for regressions also, especially if there's an ongoing effort to make -Os better.
It would be good to have more system-oriented metrics as well, such as boot, login and app launch times, and cache and TLB performance. Results of microbenchmarks can be quite misleading when it comes to the performance of the system as a whole. I'm not sure the best way to approach that--- many variables affect performance, and you'd need to build many packages to get a system to benchmark. It might be overkill; the toolchain can definitely influence these such metrics, but it may become a less-dominant factor once you're studying a large enough blob of software.
Cheers ---Dave
On Sat, Sep 18, 2010 at 3:00 AM, Dave Martin dave.martin@linaro.org wrote:
Hi,
On Fri, Sep 17, 2010 at 3:50 AM, Michael Hope michael.hope@linaro.org wrote:
It's only part of the puzzle, but I run speed benchmarks as part of the continious build: http://ex.seabright.co.nz/helpers/buildlog http://ex.seabright.co.nz/helpers/benchcompare http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
This might be one way to approach things. It's simple to add other programs into the mix.
Could you easily add code size metrics?
The build currently runs 'size' on every executable file it can find. See: http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
for an example.
It would be useful to watch those for regressions also, especially if there's an ongoing effort to make -Os better.
Yip. I'm recording at the moment and hope to hand the reporting side off to the Infrastructure team.
It would be good to have more system-oriented metrics as well, such as boot, login and app launch times, and cache and TLB performance. Results of microbenchmarks can be quite misleading when it comes to the performance of the system as a whole. I'm not sure the best way to approach that--- many variables affect performance, and you'd need to build many packages to get a system to benchmark. It might be overkill; the toolchain can definitely influence these such metrics, but it may become a less-dominant factor once you're studying a large enough blob of software.
It's not wholly a toolchain issue but one I'm interested in. Something we should talk about at Linaro@UDS...
-- Michael
Hi,
On Sun, Sep 19, 2010 at 9:40 PM, Michael Hope michael.hope@linaro.org wrote:
[...]
I've just modified this to build different variants as well. ffmpeg now builds as supplied (-O2 and others), with -Os, with hand-written assembler turned off, and with -mfpu=neon. corebench builds in -O2 and -Os.
Sounds good.
Could you easily add code size metrics?
The build currently runs 'size' on every executable file it can find. See: http://ex.seabright.co.nz/build/gcc-linaro-4.5-2010.09-1/logs/armv7l-maveric...
OK, that looks like enough to get some useful summary information.
It's not wholly a toolchain issue but one I'm interested in. Something we should talk about at Linaro@UDS...
Yep, that sounds like a good idea.
Cheers ---Dave
Yao Qi wrote:
Hi, We are looking for some possible improvements and optimizations on thumb2 code size. Currently, I am running some benchmarks with compilation flag "-Os -march=armv7-a -mthumb", and hope to find some thing interesting that we can improve. Beside that, do you have some ideas on this topic? or do you have some observations on thumb2 code that we may probably improve the size?
Any thoughts on this are appreciated.
I found some new possible improvements. Your comments on them are welcome. See more details in https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SizeOptimize
10. Replace multiple vldr by vldm Observed in bezier01float/bez.o, 8: f100 0438 add.w r4, r0, #56 ; 0x38 c: b085 sub sp, #20 e: 2600 movs r6, #0 10: e03d b.n 8e <interpolatePoints+0x8e> 12: e954 2302 ldrd r2, r3, [r4, #-8] 16: 2500 movs r5, #0 18: ed14 ab0e vldr d10, [r4, #-56] ; 0xffffffc8 // <-- 1c: ed14 bb0c vldr d11, [r4, #-48] ; 0xffffffd0 // <-- 20: ed14 cb0a vldr d12, [r4, #-40] ; 0xffffffd8 // <-- 24: ed14 db08 vldr d13, [r4, #-32] ; 0xffffffe0 // <-- 28: e9cd 2300 strd r2, r3, [sp] 2c: ed14 eb06 vldr d14, [r4, #-24] ; 0xffffffe8 // <--
These vldr instructions can be replaced by one vldm.
11. Replace str/ldr by memcpy Observed in bezier01fixed/pointio.o:outputPoints() 00000000 <outputPoints>: 0: e92d 4ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} 4: 4604 mov r4, r0 6: b089 sub sp, #36 ; 0x24 8: 2600 movs r6, #0 a: 460f mov r7, r1 c: e025 b.n 5a <outputPoints+0x5a> e: 68e3 ldr r3, [r4, #12] 10: 2500 movs r5, #0 12: e894 0e00 ldmia.w r4, {r9, sl, fp} 16: 9303 str r3, [sp, #12] 18: 6923 ldr r3, [r4, #16] 1a: 9304 str r3, [sp, #16] 1c: 6963 ldr r3, [r4, #20] 1e: 9305 str r3, [sp, #20] 20: 69a3 ldr r3, [r4, #24] 22: 9306 str r3, [sp, #24] 24: 69e3 ldr r3, [r4, #28] 26: 9307 str r3, [sp, #28] code size will be smaller if we replace ldr/str by memcpy().
12. uxth/sxth Observed in automotive/idctrn01/bmark.c short unPack( unsigned char c ) { /* Only want lower four bit nibble */ c = c & (unsigned char)0x0F ;
if( c > 7 ) { /* Negative nibble */ return( ( short )( c - 16 ) ) ; } else { /* positive nibble */ return( ( short )c ) ; } }
GCC produces code like this, 00000024 <unPack>: 24: f000 000f and.w r0, r0, #15 28: 2807 cmp r0, #7 2a: d901 bls.n 30 <unPack+0xc> 2c: 3810 subs r0, #16 2e: b280 uxth r0, r0 <--[1] 30: b200 sxth r0, r0 <--[2] 32: 4770 bx lr
Are instruction [1] and [2] redundant? Can we remove these two instructions? If they are redundant, we can remove them safely.
On 09/09/10 16:22, Yao Qi wrote:
GCC produces code like this, 00000024<unPack>: 24: f000 000f and.w r0, r0, #15 28: 2807 cmp r0, #7 2a: d901 bls.n 30<unPack+0xc> 2c: 3810 subs r0, #16 2e: b280 uxth r0, r0<--[1] 30: b200 sxth r0, r0<--[2] 32: 4770 bx lr
Are instruction [1] and [2] redundant? Can we remove these two instructions? If they are redundant, we can remove them safely.
Yes, I'd say they were redundant.
In one code path, the result is always positive, and strictly <16, so the sign extend is a NOP.
In the other code path, UXTH followed by SXTH is always equivalent to SXTH alone, regardless of input.
I wondered for a while whether the extension or rotation did anything cunning to the status register, or something, but it seems not.
Andrew
On Thu, 9 Sep 2010, Andrew Stubbs wrote:
On 09/09/10 16:22, Yao Qi wrote:
GCC produces code like this, 00000024<unPack>: 24: f000 000f and.w r0, r0, #15 28: 2807 cmp r0, #7 2a: d901 bls.n 30<unPack+0xc> 2c: 3810 subs r0, #16 2e: b280 uxth r0, r0<--[1] 30: b200 sxth r0, r0<--[2] 32: 4770 bx lr
Are instruction [1] and [2] redundant? Can we remove these two instructions? If they are redundant, we can remove them safely.
Yes, I'd say they were redundant.
In one code path, the result is always positive, and strictly <16, so the sign extend is a NOP.
In the other code path, UXTH followed by SXTH is always equivalent to SXTH alone, regardless of input.
I wondered for a while whether the extension or rotation did anything cunning to the status register, or something, but it seems not.
Of course, the optimal code sequence for this function would be:
lsl r0, r0, #28 asr r0, r0, #28 bx lr
But I doubt gcc could ever become that smart.
Nicolas
On Fri, Sep 10, 2010 at 2:14 AM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
[...]
lsl r0, r0, #28 asr r0, r0, #28 bx lr
But I doubt gcc could ever become that smart.
Some pointed out to me that the tempting C equivalent
(int)((unsigned)c << 28) >> 28
is invalid C, because the result of the unsigned->signed cast (needed to get arithmetic right shift) is undefined if the argument is > INT_MAX. Maybe that's why the eembc code is so verbose.
Of course, that C snippet is often used in practice, and works on common architectures using a sane integer representation.
Cheers ---Dave
On 9/10/2010 2:17 AM, Dave Martin wrote:
(int)((unsigned)c << 28) >> 28
is invalid C, because the result of the unsigned->signed cast (needed to get arithmetic right shift) is undefined if the argument is > INT_MAX.
True, but undefined-ness (or, in this case, implementation-defined-ness, IIRC) helps the compiler; it's free to do as it wants. So, this is no excuse for the compiler generating slow code. (In particular, the compiler needn't generate a test to see whether the argument is greater than INT_MAX.)
linaro-toolchain@lists.linaro.org