Mans Rullgard mans.rullgard@linaro.org writes:
On 22 May 2013 05:13, Michael Hudson-Doyle michael.hudson@canonical.com wrote:
Hi all,
I've spent a little while porting an optimization from Python 3 to Python 2.7 (http://bugs.python.org/issue4753). The idea of the patch is to improve performance by dispatching opcodes on computed labels rather than a big switch -- and so confusing the branch predictor less.
The problem with this is that the last bit of code for each opcode ends up being the same, so common subexpression elimination wants to coalesce all these bits, which neatly and completely nullifies the point of the optimization.
The branches added by this would be unconditional and should thus not add any load on the branch predictor.
Playing around just building from source directly, it seems that -fno-gcse prevents gcc from doing this, and the resulting interpreter shows a small performance improvement over a build that does not include the patch.
However, when I build a debian package containing the patch, I see no improvement at all. My theory, and I'd like you guys to tell me if this makes sense, is that this is because the Debian package uses link time optimization, and so even though I carefully compile ceval.c with -fno-gcse, the common subexpression elimination happens anyway at link time. I've tried staring at disassembly to confirm or deny this but I don't know ARM assembly very well and the compiled function is roughtly 10k instructions long so I didn't get very far with this (I can supply the disassembly if someone wants to see it!).
Is there some way I can tell GCC to not compile perform CSE on a section of code? I guess I can make sure that the whole program, linker step and all, is compiled with -fno-gcse but that seems a bit of a blunt hammer.
When using LTO, most of the optimisations happen, as the name implies, during linking. The optimisation flags provided there, whether explicit or default, are used for everything.
OK. I wasn't sure initially whether the optimizations that were performed at link time were the same as the ones that are traditionally performed at compile time, but reading the docs again makes it clear (ish) that they are.
If you need to disable CSE for part of the code, you might want to try your luck with __attribute__((optimize("no-gcse"))) on the relevant functions.
I'd also be interested if you think this class of optimization makes little sense on ARM and then I'll stop and find something else to do :-)
I suggest running some benchmarks under perf and counting branch prediction misses. Maybe it's not as much of a problem as you think.
Well, I recompiled with -fno-gcse globally and the change now does result in a reasonable performance increase, in the 3-7 % range. perf stat suggests that this is because it reduces the overall number of branches rather than the rate of branch misses though...
Cheers, mwh