Zhenqiang's been working on the later split 2 patch which causes more constants to be built using a movw/movt instead of a constant pool load. There was an unexpected ~10 % regression in one benchmark which seems to be due to function alignment. I think we've tracked down the reason but not the action.
Compared to the baseline, the split2 branch took 113 % of the time to run, i.e. 13 % longer. Adding an explicit 16 byte alignment to the function changed this to 97 % of the time, i.e. 3 % faster. The reason Zhenqiang and I got different results was the build-id. He used the binary build scripts to make the cross compiler, which turn on the build ID, which added an extra 20 bytes ahead of .text, which happened to align the function to 16 bytes. cbuild doesn't use the build-id (although it should) which happened to align the function to an 8 byte boundary.
The disassembly is identical so I assume the regression is cache or fast loop related. I'm not sure what to do, so let's talk about this at the next performance call.
-- Michael