AND vs UXTB

List overview All Threads
Download

newer

older

Agenda for Performance call 7th...

[ACTIVITY] Jul 30 - Aug 3

Mans Rullgard

3 Aug 2012 3 Aug '12

12:49 p.m.

I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.

-- Mans Rullgard / mru

Show replies by date

Richard Earnshaw

3 Aug 3 Aug

12:53 p.m.

On 03/08/12 13:49, Mans Rullgard wrote:

...

I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.

UXTB on the other hand is a 16-bit instruction, whereas AND is a 32-bit one.

Of the cores I'm aware of, only A9 has this performance anomaly.

Mans Rullgard

1:08 p.m.

On 3 August 2012 13:53, Richard Earnshaw rearnsha@arm.com wrote:

...

On 03/08/12 13:49, Mans Rullgard wrote:

...
I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.

UXTB on the other hand is a 16-bit instruction, whereas AND is a 32-bit one.

Only in Thumb.

...

Of the cores I'm aware of, only A9 has this performance anomaly.

It is also a very widely used core.

-- Mans Rullgard / mru

Siarhei Siamashka

1:23 p.m.

On Fri, Aug 3, 2012 at 3:53 PM, Richard Earnshaw rearnsha@arm.com wrote:

...

On 03/08/12 13:49, Mans Rullgard wrote:

...
I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.

UXTB on the other hand is a 16-bit instruction, whereas AND is a 32-bit one.

Of the cores I'm aware of, only A9 has this performance anomaly.

While you are at it, please also consider blacklisting UXTAB instruction variants when tuning for Cortex-A9 unless optimizing for size. I was fairly confident that I had a feature request in gcc bugzilla about this, but apparently this is not the case. My bad.

-- Best regards, Siarhei Siamashka

Michael Hope

5 Aug 5 Aug

9:26 p.m.

On 4 August 2012 00:53, Richard Earnshaw rearnsha@arm.com wrote:

...

On 03/08/12 13:49, Mans Rullgard wrote:

...
I have noticed gcc has a preference for generating UXTB instructions when an AND with #255 would do the same thing. This is bad, because on A9 UXTB has two cycles latency compared to one cycle for AND. On A8 both instructions have one cycle latency.

UXTB on the other hand is a 16-bit instruction, whereas AND is a 32-bit one.

Of the cores I'm aware of, only A9 has this performance anomaly.

The CoreMark regression between 4.4 and 4.5 that Chung-Lin fixed was due to an AND being replaced with a UXTB. The instruction is slower, and the AND does a compare with zero for free.

-- Michael

4748

days inactive

4750

days old

linaro-toolchain@lists.linaro.org

4 comments

participants

tags (0)

participants (4)

Mans Rullgard
Michael Hope
Richard Earnshaw
Siarhei Siamashka