First of all, the goal of this work is about investigation on speed improvement on linaro gcc 4.5. Finally, the output/result of this work is to list all possible recommendations/actions to improve speed on linaro 4.5. Comments to this plan are welcome.
So far, we can improve speed in three ways, 1. Backport patches from FSF GCC 4.6. Note that we don't want to backport the whole 4.6. 2. Benchmark with FSF GCC 4.5.0. Fix performance regressions if there are on linaro gcc 4.5. Output is the reason of performance regression, or even further, give recommendations on how to fix it. 3. Study the code generated by other ARM compilers, and give recommendations on how to improve GCC to do better job. I'll describe these three ways in details in the following sections,
- Backport patches from FSF GCC 4.6
I went through gcc-patches archive, and select several patches that are helpful to code improvements. 1 ifcvt optimization. Target independent. http://gcc.gnu.org/ml/gcc-patches/2010-04/msg00832.html 2 redundant register move for sign extending. Thumb2. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43137 3. PR 45335 Use ldrd and strd to access two consecutive words. Not yet approved. http://gcc.gnu.org/ml/gcc-patches/2010-09/msg00059.html 4. Fix an if statement in arm_rtx_costs_1. http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html 5. Reduce code duplication for Thumb2 move patterns http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00624.html 6. ARM ldm/stm peepholes http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00512.html 7. PR44999 Replace "and r0, r0, #255" with uxtb in thumb2 http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01700.html 8. Improve optimization to transform TST into LSLS http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html 9. Fix bswap patterns for ARM / Thumb and Thumb2. http://gcc.gnu.org/ml/gcc-patches/2010-01/msg01238.html
- Fix speed regression I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC 4.5.0, and I'll investigate why speed regression happens on these cases. Here is a table below about speed regression compared between FSF GCC 4.5.0 and Linaro GCC 4.5 (revno:99398) O2 O3 puwmod01, -5.5 -3.5 bitmnp01, -7.9 -0.7 routelookup, -6.4 -8.2 conven00data_1, -7.2 -5.8 conven00data_2, -8.1 -7.3 conven00data_3, -6.6 -5.5 viterb00data_1, -1.7 +5.9 viterb00data_2, -4.3 +2.6 viterb00data_3, -2.3 +1.8 viterb00data_4, -5.3 -0.3
- Study the code generated by other ARM compilers. In this part, I'll study the binary generated by other ARM compilers, and try to teach GCC smart enough to do the same thing. This piece of work is quite open, and hard to estimate how much output we could get.
- Fix bswap patterns for ARM / Thumb and Thumb2.
This should already be in 4.5 branch. It was done before the 4.5 branch was cut IIRC.
cheers Ramana
On 10/05/2010 10:39 PM, Ramana Radhakrishnan wrote:
- Fix bswap patterns for ARM / Thumb and Thumb2.
This should already be in 4.5 branch. It was done before the 4.5 branch was cut IIRC.
Ramana, Thanks, it is in 4.5 branch.
cheers Ramana
On 10/05/2010 10:01 PM, Yao Qi wrote:
- Fix speed regression
I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC 4.5.0, and I'll investigate why speed regression happens on these cases. Here is a table below about speed regression compared between FSF GCC 4.5.0 and Linaro GCC 4.5 (revno:99398) O2 O3 puwmod01, -5.5 -3.5 bitmnp01, -7.9 -0.7 routelookup, -6.4 -8.2 conven00data_1, -7.2 -5.8 conven00data_2, -8.1 -7.3 conven00data_3, -6.6 -5.5 viterb00data_1, -1.7 +5.9 viterb00data_2, -4.3 +2.6 viterb00data_3, -2.3 +1.8 viterb00data_4, -5.3 -0.3
Update my tree to latest linaro 4.5 (revno:99399), result is similar to revno:99398.
On 05/10/10 15:01, Yao Qi wrote:
First of all, the goal of this work is about investigation on speed improvement on linaro gcc 4.5. Finally, the output/result of this work is to list all possible recommendations/actions to improve speed on linaro 4.5. Comments to this plan are welcome.
I believe we also want to consider implementing speed improvements to the upstream 4.6 (even if those changes cannot be backported to 4.5).
Andrew
On 10/06/2010 05:00 PM, Andrew Stubbs wrote:
On 05/10/10 15:01, Yao Qi wrote:
First of all, the goal of this work is about investigation on speed improvement on linaro gcc 4.5. Finally, the output/result of this work is to list all possible recommendations/actions to improve speed on linaro 4.5. Comments to this plan are welcome.
I believe we also want to consider implementing speed improvements to the upstream 4.6 (even if those changes cannot be backported to 4.5).
Yeah, Investigation on part 3 "study on other ARM compilers" can give us some ideas on 4.6 improvements as well.
On 10/05/2010 10:01 PM, Yao Qi wrote:
- Fix speed regression
I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC 4.5.0, and I'll investigate why speed regression happens on these cases. Here is a table below about speed regression compared between FSF GCC 4.5.0 and Linaro GCC 4.5 (revno:99398) O2 O3 puwmod01, -5.5 -3.5 bitmnp01, -7.9 -0.7 routelookup, -6.4 -8.2 conven00data_1, -7.2 -5.8 conven00data_2, -8.1 -7.3 conven00data_3, -6.6 -5.5 viterb00data_1, -1.7 +5.9 viterb00data_2, -4.3 +2.6 viterb00data_3, -2.3 +1.8 viterb00data_4, -5.3 -0.3
[First of all, this is what I should send out after last Friday's meeting. Sorry for being late.]
The speed regression on O2 is caused by four patches listed [1].
In short, two of them (r99380 and r99330) include many changes, so hard to say which specific change causes speed regression on linaro 4.5 tree. It is interesting to have a look at the other two patches (r99324 and r99369).
r99324. "Restrict base registers to low regs for Thumb-2". r99369. "Cortex-A8 optimisation: split symbol refs using MOVT/MOVW later." Details can be found in [1].
EEMBC with O3 is blocked by some troubles of board. I'll post O3 data once is it done.
[1] https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SpeedOptimization
linaro-toolchain@lists.linaro.org