Hi all,
I've spent a little while porting an optimization from Python 3 to Python 2.7 (http://bugs.python.org/issue4753). The idea of the patch is to improve performance by dispatching opcodes on computed labels rather than a big switch -- and so confusing the branch predictor less.
The problem with this is that the last bit of code for each opcode ends up being the same, so common subexpression elimination wants to coalesce all these bits, which neatly and completely nullifies the point of the optimization. Playing around just building from source directly, it seems that -fno-gcse prevents gcc from doing this, and the resulting interpreter shows a small performance improvement over a build that does not include the patch.
However, when I build a debian package containing the patch, I see no improvement at all. My theory, and I'd like you guys to tell me if this makes sense, is that this is because the Debian package uses link time optimization, and so even though I carefully compile ceval.c with -fno-gcse, the common subexpression elimination happens anyway at link time. I've tried staring at disassembly to confirm or deny this but I don't know ARM assembly very well and the compiled function is roughtly 10k instructions long so I didn't get very far with this (I can supply the disassembly if someone wants to see it!).
Is there some way I can tell GCC to not compile perform CSE on a section of code? I guess I can make sure that the whole program, linker step and all, is compiled with -fno-gcse but that seems a bit of a blunt hammer.
I'd also be interested if you think this class of optimization makes little sense on ARM and then I'll stop and find something else to do :-)
Cheers, mwh