On Wed, 2010-11-03 at 17:39 +0800, Yao Qi wrote:
Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this,
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2
After reading cortex-a8 TRM, I can't find exact timing cycles of lsls. Under Chung-Lin's help, we feel that lsls should be slower than tst, but don't have any evidence to prove. If any people is familiar with arm microarch, help is welcome. If our assumption is correct, we may can change this patch to an optimization specific to size only.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
So I have no reason to expect lsls to ever take longer to execute than tst. I suspect what you are seeing here is some unfortunate side effect that can't be explained from the small code snippet. An example would include BTAC aliasing, but there could be other reasons for this happening.
So overall, I'd expect the change to be a Good Thing (tm), but there's always the chance that individual blocks of code may run more slowly.
R.