Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this,
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2
After reading cortex-a8 TRM, I can't find exact timing cycles of lsls. Under Chung-Lin's help, we feel that lsls should be slower than tst, but don't have any evidence to prove. If any people is familiar with arm microarch, help is welcome. If our assumption is correct, we may can change this patch to an optimization specific to size only.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
On 11/03/2010 05:39 PM, Yao Qi wrote:
Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this,
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2
Someone suggests that the slowdown might be caused by usage of r0 in first instruction. Since r0 is used in the first insn, the third insn lsls can't overwrite r0 until first insn ldr is done.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
As we discussed in the meeting yesterday, the criteria of us picking up upstreams patches is that patches don't slow down speed and don't increase code size.
Code size is not reduced either on A8. I'll re-test this patch on A9. If still no benefit either size or speed, we don't backport it to Linaro 4.5.
Yao Qi wrote:
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2
Someone suggests that the slowdown might be caused by usage of r0 in first instruction. Since r0 is used in the first insn, the third insn lsls can't overwrite r0 until first insn ldr is done.
It depends on whether the Cortex-A8 implemented any form of register renaming features. If they did, this should not be the problem.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
As we discussed in the meeting yesterday, the criteria of us picking up upstreams patches is that patches don't slow down speed and don't increase code size.
Code size is not reduced either on A8. I'll re-test this patch on A9. If still no benefit either size or speed, we don't backport it to Linaro 4.5.
Well, you might try some other benchmarks? Maybe building ffmpeg or the Linux kernel to see if any code generation differences? Even slight improvements may warrant the backport, as it seems quite harmless.
Chung-Lin
On Wed, 2010-11-03 at 17:39 +0800, Yao Qi wrote:
Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this,
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2
After reading cortex-a8 TRM, I can't find exact timing cycles of lsls. Under Chung-Lin's help, we feel that lsls should be slower than tst, but don't have any evidence to prove. If any people is familiar with arm microarch, help is welcome. If our assumption is correct, we may can change this patch to an optimization specific to size only.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
So I have no reason to expect lsls to ever take longer to execute than tst. I suspect what you are seeing here is some unfortunate side effect that can't be explained from the small code snippet. An example would include BTAC aliasing, but there could be other reasons for this happening.
So overall, I'd expect the change to be a Good Thing (tm), but there's always the chance that individual blocks of code may run more slowly.
R.
On 11/06/2010 01:38 AM, Richard Earnshaw wrote:
On Wed, 2010-11-03 at 17:39 +0800, Yao Qi wrote:
Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this,
6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc<t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc<t_run_test+0xcc> 4610 mov r0, r2
After reading cortex-a8 TRM, I can't find exact timing cycles of lsls. Under Chung-Lin's help, we feel that lsls should be slower than tst, but don't have any evidence to prove. If any people is familiar with arm microarch, help is welcome. If our assumption is correct, we may can change this patch to an optimization specific to size only.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here.
So I have no reason to expect lsls to ever take longer to execute than tst. I suspect what you are seeing here is some unfortunate side effect that can't be explained from the small code snippet. An example would include BTAC aliasing, but there could be other reasons for this happening.
So overall, I'd expect the change to be a Good Thing (tm), but there's always the chance that individual blocks of code may run more slowly.
Richard, We feel that "TST to LSLS transformation" is a sort of size optimization. Can we change this "TST to LSLS transformation" to a -Os specific optimization? If you agree on this, we can submit a patch for this change.
linaro-toolchain@lists.linaro.org