On 4 August 2012 00:53, Richard Earnshaw rearnsha@arm.com wrote:
On 03/08/12 13:49, Mans Rullgard wrote:
I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.
UXTB on the other hand is a 16-bit instruction, whereas AND is a 32-bit one.
Of the cores I'm aware of, only A9 has this performance anomaly.
The CoreMark regression between 4.4 and 4.5 that Chung-Lin fixed was due to an AND being replaced with a UXTB. The instruction is slower, and the AND does a compare with zero for free.
-- Michael