Plan of CS304: Thumb2 tuning investigation - linaro-toolchain

5 Oct 2010


      First of all, the goal of this work is about investigation on speed
improvement on linaro gcc 4.5.  Finally, the output/result of this work
is to list all possible recommendations/actions to improve speed on
linaro 4.5.  Comments to this plan are welcome.
So far, we can improve speed in three ways,
 1. Backport patches from FSF GCC 4.6.  Note that we don't want to
backport the whole 4.6.
 2. Benchmark with FSF GCC 4.5.0.  Fix performance regressions if there
are on linaro gcc 4.5.  Output is the reason of performance regression,
or even further, give recommendations on how to fix it.
 3. Study the code generated by other ARM compilers, and give
recommendations on how to improve GCC to do better job.
I'll describe these three ways in details in the following sections,
- Backport patches from FSF GCC 4.6
I went through gcc-patches archive, and select several patches that are
helpful to code improvements.
1 ifcvt optimization. Target independent.
http://gcc.gnu.org/ml/gcc-patches/2010-04/msg00832.html
2 redundant register move for sign extending.  Thumb2.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43137
3. PR 45335 Use ldrd and strd to access two consecutive words.
Not yet approved.
http://gcc.gnu.org/ml/gcc-patches/2010-09/msg00059.html
4. Fix an if statement in arm_rtx_costs_1.
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html
5. Reduce code duplication for Thumb2 move patterns
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00624.html
6. ARM ldm/stm peepholes
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00512.html
7. PR44999 Replace "and r0, r0, #255" with uxtb in thumb2
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01700.html
8. Improve optimization to transform TST into LSLS
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html
9. Fix bswap patterns for ARM / Thumb and Thumb2.
http://gcc.gnu.org/ml/gcc-patches/2010-01/msg01238.html
- Fix speed regression
I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC
4.5.0, and I'll investigate why speed regression happens on these cases.
 Here is a table below about speed regression compared between FSF GCC
4.5.0 and Linaro GCC 4.5 (revno:99398)
    	O2	O3
puwmod01,	-5.5	-3.5
bitmnp01,	-7.9	-0.7
routelookup,	-6.4	-8.2
conven00data_1,	-7.2	-5.8
conven00data_2,	-8.1	-7.3
conven00data_3,	-6.6	-5.5
viterb00data_1,	-1.7	+5.9
viterb00data_2,	-4.3	+2.6
viterb00data_3,	-2.3	+1.8
viterb00data_4,	-5.3	-0.3
- Study the code generated by other ARM compilers.
In this part, I'll study the binary generated by other ARM compilers,
and try to teach GCC smart enough to do the same thing.  This piece of
work is quite open, and hard to estimate how much output we could get.
-- 
Yao Qi
CodeSourcery
yao@codesourcery.com
(650) 331-3385 x739