Hi,
So one of the things Michael pointed out in today's call was that the ARM backend doesn't generate vcvt.f32.s<type> where you have an idiom conversion from fixed to floating point as in the example below. I've chosen to implement this in the following manner in the backend using these interfaces from real.c . The reason I've chosen to not allow this transformation in case flag_rounding_math is true is because this instruction always ends up rounding using round-to-nearest rather than obeying whats in the FPSCR and thus is not safe for programs that want to dynamically set their rounding modes.
The benefits are quite obvious in that we eliminate a load from the constant pool and a floating point multiply and thus essentially shaving off a floating point multiply + Load latency off these sequences. This instruction can only write the output into the same register as the input register which is why I've modelled it as below by tying op1 into op0.
If there's a simpler way of using the interfaces into real.c then I'm all ears ?
Thoughts ? I believe such idioms are used in libav from where the original report appears to have come and thus it's a worthwhile gain where we can have it. Any other places where folks might have noticed this.
I will post upstream as well once I finish testing this patch. I'm posting this here to get some feedback as well to let anyone who is really really keen about trying this out have a go given I'm out tomorrow.
( I took a quick look at the short -> f32 case as well but the fact remains that loads either zero or sign extend anyway so there's probably not much gain in modelling that right away and the win really is in getting rid of that fp mul and the constant pool load. There's probably some gain in going from i64-> f64 as well so those patterns need to be written up at some point for completeness )
cheers Ramana
2011-10-04 Ramana Radhakrishnan ramana.radhakrishnan@linaro.org
* config/arm/arm.c (vfp3_const_double_for_fract_bits): Define. * config/arm/arm-protos.h (vfp3_const_double_for_fract_bits): Declare. * config/arm/constraints.md ("Dt"): New constraint. * config/arm/predicates.md (const_double_vcvt_power_of_two_reciprocal): New. * config/arm/vfp.md (*arm_combine_vcvt_f32_s32): New. (*arm_combine_vcvt_f32_u32): New.
For the following testcases I see the code as follows with -mfloat-abi=hard -mfpu=vfpv3 and -mcpu=cortex-a9
float foo (int i) { float v = (float)i / (1 << 11); return v; } float foa_unsigned (unsigned int i) { float v = (float)i / (1 << 5); return v; }
After patch .
foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s0, r0 @ int vcvt.f32.s32 s0, s0, #11 bx lr .size foo, .-foo .align 2 .global foa_unsigned .type foa_unsigned, %function foa_unsigned: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s0, r0 @ int vcvt.f32.u32 s0, s0, #5 bx lr .size foa_unsigned, .-foa_unsigned .align 2 .global foo1 .type foo1, %function
rather than .type foo, %function foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s15, r0 @ int fsitos s0, s15 flds s15, .L2 fmuls s0, s0, s15 bx lr .L3: .align 2 .L2: .word 973078528 .size foo, .-foo .align 2 .global foa_unsigned .type foa_unsigned, %function foa_unsigned: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s15, r0 @ int fuitos s0, s15 flds s15, .L5 fmuls s0, s0, s15 bx lr .L6: .align 2 .L5: .word 1023410176
linaro-toolchain@lists.linaro.org