Zhenqiang's been working on the later split 2 patch which causes more constants to be built using a movw/movt instead of a constant pool load. There was an unexpected ~10 % regression in one benchmark which seems to be due to function alignment. I think we've tracked down the reason but not the action.
Compared to the baseline, the split2 branch took 113 % of the time to run, i.e. 13 % longer. Adding an explicit 16 byte alignment to the function changed this to 97 % of the time, i.e. 3 % faster. The reason Zhenqiang and I got different results was the build-id. He used the binary build scripts to make the cross compiler, which turn on the build ID, which added an extra 20 bytes ahead of .text, which happened to align the function to 16 bytes. cbuild doesn't use the build-id (although it should) which happened to align the function to an 8 byte boundary.
The disassembly is identical so I assume the regression is cache or fast loop related. I'm not sure what to do, so let's talk about this at the next performance call.
-- Michael
A small case is attached to reproduce it.
Here are logs for different loop header alignment (default is 64): linaro@Linaro-test:~$ gcc test1.c -o t.exe && time ./t.exe
real 0m3.206s user 0m3.203s sys 0m0.000s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_2 -o t.exe && time ./t.exe
real 0m2.898s user 0m2.875s sys 0m0.016s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_4 -o t.exe && time ./t.exe
real 0m2.851s user 0m2.844s sys 0m0.008s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_8 -o t.exe && time ./t.exe
real 0m3.167s user 0m3.156s sys 0m0.000s
Thanks! -Zhenqiang
On 23 August 2012 10:09, Michael Hope michael.hope@linaro.org wrote:
Zhenqiang's been working on the later split 2 patch which causes more constants to be built using a movw/movt instead of a constant pool load. There was an unexpected ~10 % regression in one benchmark which seems to be due to function alignment. I think we've tracked down the reason but not the action.
Compared to the baseline, the split2 branch took 113 % of the time to run, i.e. 13 % longer. Adding an explicit 16 byte alignment to the function changed this to 97 % of the time, i.e. 3 % faster. The reason Zhenqiang and I got different results was the build-id. He used the binary build scripts to make the cross compiler, which turn on the build ID, which added an extra 20 bytes ahead of .text, which happened to align the function to 16 bytes. cbuild doesn't use the build-id (although it should) which happened to align the function to an 8 byte boundary.
The disassembly is identical so I assume the regression is cache or fast loop related. I'm not sure what to do, so let's talk about this at the next performance call.
-- Michael
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Michael,
On 23 August 2012 03:09, Michael Hope michael.hope@linaro.org wrote:
Zhenqiang's been working on the later split 2 patch which causes more constants to be built using a movw/movt instead of a constant pool load. There was an unexpected ~10 % regression in one benchmark which seems to be due to function alignment. I think we've tracked down the reason but not the action.
Compared to the baseline, the split2 branch took 113 % of the time to run, i.e. 13 % longer. Adding an explicit 16 byte alignment to the function changed this to 97 % of the time, i.e. 3 % faster. The reason Zhenqiang and I got different results was the build-id. He used the binary build scripts to make the cross compiler, which turn on the build ID, which added an extra 20 bytes ahead of .text, which happened to align the function to 16 bytes. cbuild doesn't use the build-id (although it should) which happened to align the function to an 8 byte boundary.
The disassembly is identical so I assume the regression is cache or fast loop related. I'm not sure what to do, so let's talk about this at the next performance call.
I've made a note in the agenda for the performance call, but here are some quick notes/questions that come to my mind:
My guesses would include cache alignment and wide Thumb-2 instructions straddling cache-line boundaries changing core performance.
My thoughts on further investigations would be - does the function need to be aligned, or is it a hot loop in the function? Can we manually alter the code to choose different instructions to make sure there are none that straddle a cache-line boundary - and if so what happens as we change alignment?
If it is code (either function or loop) alignment and not instruction sizes that are the issue then we can probably do something about it. If it is instruction sizes then we need to work out a way to mitigate the effects - as GCC doesn't have precise instruction size knowledge.
Thanks,
Matt
linaro-toolchain@lists.linaro.org