[RFC ARM] Use vcvt.f32.s32 with immediate bits to do fixed to floating point conversions. - linaro-toolchain

4 Oct 2011


      Hi,
So one of the things Michael pointed out in today's call was that the
ARM backend doesn't generate vcvt.f32.s<type> where you have an idiom
conversion from fixed to floating point as in the example below. I've
chosen to implement this in the following manner in the backend using
these interfaces from real.c . The reason I've chosen to not allow
this transformation in case flag_rounding_math is true is because this
instruction always ends up rounding using round-to-nearest rather than
obeying whats in the FPSCR and thus is not safe for programs that want
to dynamically set their rounding modes.
The benefits are quite obvious in that we eliminate a load from the
constant pool and a floating point multiply and thus essentially
shaving off a floating point multiply + Load latency off these
sequences. This instruction can only write the output into the same
register as the input register which is why I've modelled it as below
by tying op1 into op0.
If there's a simpler way of using the interfaces into real.c then I'm all ears ?
Thoughts ? I believe such idioms are used in libav from where the
original report appears to have come and thus it's a worthwhile gain
where we can have it. Any other places where folks might have noticed
this.
I will post upstream as well once I finish testing this patch. I'm
posting this here to get some feedback as well to let anyone who is
really really keen about trying this out have a go given I'm out
tomorrow.
( I took a quick look at the short -> f32 case as well but the fact
remains that loads either zero or sign extend anyway so there's
probably not much gain in modelling that right away and the win really
is in getting rid of that fp mul and the constant pool load. There's
probably some gain in going from i64-> f64 as well so those patterns
need to be written up at some point for completeness )
cheers
Ramana
2011-10-04  Ramana Radhakrishnan  ramana.radhakrishnan@linaro.org
* config/arm/arm.c (vfp3_const_double_for_fract_bits): Define.
    * config/arm/arm-protos.h (vfp3_const_double_for_fract_bits): Declare.
    * config/arm/constraints.md ("Dt"): New constraint.
    * config/arm/predicates.md (const_double_vcvt_power_of_two_reciprocal):
    New.
    * config/arm/vfp.md (*arm_combine_vcvt_f32_s32): New.
    (*arm_combine_vcvt_f32_u32): New.
For the following testcases I see the code as follows with
-mfloat-abi=hard -mfpu=vfpv3 and -mcpu=cortex-a9
float foo (int i)
{
 float v = (float)i / (1 << 11);
 return v;
}
float foa_unsigned (unsigned int i)
{
 float v = (float)i / (1 << 5);
 return v;
}
After patch .
foo:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    fmsr	s0, r0	@ int
    vcvt.f32.s32	s0, s0, #11
    bx	lr
    .size	foo, .-foo
    .align	2
    .global	foa_unsigned
    .type	foa_unsigned, %function
foa_unsigned:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    fmsr	s0, r0	@ int
    vcvt.f32.u32	s0, s0, #5
    bx	lr
    .size	foa_unsigned, .-foa_unsigned
    .align	2
    .global	foo1
    .type	foo1, %function
rather than
    .type	foo, %function
foo:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    fmsr	s15, r0	@ int
    fsitos	s0, s15
    flds	s15, .L2
    fmuls 	s0, s0, s15
    bx	lr
.L3:
    .align	2
.L2:
    .word	973078528
    .size	foo, .-foo
    .align	2
    .global	foa_unsigned
    .type	foa_unsigned, %function
foa_unsigned:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    fmsr	s15, r0	@ int
    fuitos	 s0, s15
    flds	s15, .L5
    fmuls 	s0, s0, s15
    bx	lr
.L6:
    .align	2
.L5:
    .word	1023410176