Hi Ramana,
as you pointed out, in the gcc.dg/vect/vect-double-reduc-6.c test case, using compiler options as described in PR 51819, we see the following inefficient code generation:
vmov.32 r2, d28[0] @ 57 vec_extractv4si [length = 4] vmov.32 r1, d22[0] @ 84 vec_extractv4si [length = 4] str r2, [r0, #4] @ 58 *thumb2_movsi_vfp/7 [length = 4] vmov.32 r3, d0[0] @ 111 vec_extractv4si [length = 4] str r1, [r0, #8] @ 85 *thumb2_movsi_vfp/7 [length = 4] vst1.32 {d2[0]}, [r0:64] @ 31 neon_vst1_lanev4si [length = 4] str r3, [r0, #12] @ 112 *thumb2_movsi_vfp/7 [length = 4] bx lr @ 120 *thumb2_return [length = 12]
(The :64 alignment in vst1.32 is incorrect; that is that actual problem in PR 51819, which is now fixed.)
The reason for this particular code sequence turns out to be as follows: The middle end tries to store the LSB vector lane to memory, and uses the vec_extract named pattern to do so. This pattern currently only supports an "s_register_operand" destination, and is implemented via vmov to a core register. The contents of that register are then stored to memory. Now why does any vst1 instruction show up? This is because combine is able to merge the vec_extract back into the store pattern and ends up with a pattern that matches neon_vst1_lanev4si. Note that the latter pattern is actually intended to implement NEON built-ins (vst1_lane_... etc).
Now there seem to be two problems with this scenario:
First of all, the neon_vst1_lane<mode> patterns seem to be actually incorrect on big-endian systems due to lane-numbering problems. As I understand it, all NEON intrinsics are supposed to take lane numbers according to the NEON ordering scheme, while the vec_select RTX pattern is defined to take lane numbers according to the in-memory order. Those disagree in the big-endian case. All other patterns implementing NEON intrinsics therefore avoid using vec_select, and instead resort to using special UNSPEC codes -- the sole exception to this happens to be neon_vst1_lane<mode>. It would appear that this is actually incorrect, and the pattern ought to use a UNSPEC_VST1_LANE unspec instead (that UNSPEC code is already defined, but nowhere used).
Now if we make that change, then the above code sequence will contain no vst1 any more. But in any case, expanding first into a vec_extract followed by a store pattern, only to rely on combine to merge them back together, is a suboptimal approach. One obvious drawback is that the auto-inc-dec pass runs before reload, and therefore only sees plain stores -- with no reason whatsoever to attempt to introduce post-inc operations. Also, just in general it normally works out best to allow the final choice between register and memory operands to be make in reload ...
Therefore, I think the vec_extract patterns ought to support *both* register and memory destination operands, and implement those via vmov or vst1 in final code generation, as appropriate. This means that we can make the choice in reload, guided as usual by alternative ordering and/or penalties -- for example, we can choose to reload the address and still use vst1 over reloading the contents to a core register and then using an offsetted store.
Finally, this sequence will also allow the auto-inc-dec pass to do a better job. The current in-tree pass doesn't manage unfortunately, but with Richard's proposed replacement, assisted by a tweak to the cost function to make sure the (future) address reload is "priced in" correctly, I'm now seeing what appears to be the optimal code sequence:
vst1.32 {d6[0]}, [r0:64]! @ 30 vec_extractv4si/1 [length = 4] vst1.32 {d22[0]}, [r0]! @ 56 vec_extractv4si/1 [length = 4] vst1.32 {d2[0]}, [r0:64]! @ 82 vec_extractv4si/1 [length = 4] vst1.32 {d4[0]}, [r0] @ 108 vec_extractv4si/1 [length = 4] bx lr @ 116 *thumb2_return [length = 12]
(Again the :64 is wrong; it's already fixed on mainline but I haven't pulled that change in yet.)
The attached patch implements the above-mentioned changes. Any comments? I'll try to get some performance numbers as well before moving forward with the patch ...
(As an aside, it might likewise be helpful to update the vec_set patterns to allow for memory operands, implemented via vld1.)
(See attached file: diff-gcc-arm-vecextractmem)
B.t.w. I'm wondering how I can properly test: - that the NEON intrinsics still work - that everything works on big-endian Any suggestions would be appreciated!
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294
Hi Uli,
Thanks for the detailed analysis.
The reason for this particular code sequence turns out to be as follows: The middle end tries to store the LSB vector lane to memory, and uses the vec_extract named pattern to do so. This pattern currently only supports an "s_register_operand" destination, and is implemented via vmov to a core register. The contents of that register are then stored to memory. Now why does any vst1 instruction show up? This is because combine is able to merge the vec_extract back into the store pattern and ends up with a pattern that matches neon_vst1_lanev4si. Note that the latter pattern is actually intended to implement NEON built-ins (vst1_lane_... etc).
Now there seem to be two problems with this scenario:
First of all, the neon_vst1_lane<mode> patterns seem to be actually incorrect on big-endian systems due to lane-numbering problems. As I understand it, all NEON intrinsics are supposed to take lane numbers according to the NEON ordering scheme, while the vec_select RTX pattern is defined to take lane numbers according to the in-memory order. Those disagree in the big-endian case. All other patterns implementing NEON intrinsics therefore avoid using vec_select, and instead resort to using special UNSPEC codes -- the sole exception to this happens to be neon_vst1_lane<mode>. It would appear that this is actually incorrect, and the pattern ought to use a UNSPEC_VST1_LANE unspec instead (that UNSPEC code is already defined, but nowhere used)
This looks like it is broken from Day1 for big-endian.
Now if we make that change, then the above code sequence will contain no vst1 any more. But in any case, expanding first into a vec_extract followed by a store pattern, only to rely on combine to merge them back together, is a suboptimal approach. One obvious drawback is that the auto-inc-dec pass runs before reload, and therefore only sees plain stores -- with no reason whatsoever to attempt to introduce post-inc operations. Also, just in general it normally works out best to allow the final choice between register and memory operands to be make in reload ... Therefore, I think the vec_extract patterns ought to support *both* register and memory destination operands, and implement those via vmov or vst1 in final code generation, as appropriate. This means that we can make the choice in reload, guided as usual by alternative ordering and/or penalties -- for example, we can choose to reload the address and still use vst1 over reloading the contents to a core register and then using an offsetted store.
This patch should be queued for 4.8 .Sounds sensible to me.
Finally, this sequence will also allow the auto-inc-dec pass to do a better job. The current in-tree pass doesn't manage unfortunately, but with Richard's proposed replacement, assisted by a tweak to the cost function to make sure the (future) address reload is "priced in" correctly, I'm now seeing what appears to be the optimal code sequence:
vst1.32 {d6[0]}, [r0:64]! @ 30 vec_extractv4si/1 [length = 4] vst1.32 {d22[0]}, [r0]! @ 56 vec_extractv4si/1 [length = 4] vst1.32 {d2[0]}, [r0:64]! @ 82 vec_extractv4si/1 [length = 4] vst1.32 {d4[0]}, [r0] @ 108 vec_extractv4si/1 [length = 4] bx lr @ 116 *thumb2_return [length = 12]
(Again the :64 is wrong; it's already fixed on mainline but I haven't pulled that change in yet.)
The attached patch implements the above-mentioned changes. Any comments? I'll try to get some performance numbers as well before moving forward with the patch ...
(As an aside, it might likewise be helpful to update the vec_set patterns to allow for memory operands, implemented via vld1.)
Agreed.
(See attached file: diff-gcc-arm-vecextractmem)
B.t.w. I'm wondering how I can properly test:
- that the NEON intrinsics still work
The intrinsics tests in gcc.target/arm and gcc.target/arm/neon are the only that we have today that we can run reliably. The other option is the tests from ST but as you've discovered they need the ARM tools.
- that everything works on big-endian
We'll have to put together a big-endian toolchain and test on hardware. qemu doesn't support BE8 mode today and in today's world this is best tested bare-metal on a Fast-Model or on actual hardware both of which will need bring up in terms of dejagnu glue. We're not set up for either today and will probably have to spend some effort in doing this.
I hope this helps.
cheers Ramana
Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote on 01.02.2012 16:28:04:
This patch should be queued for 4.8 .Sounds sensible to me.
OK, thanks for the review!
(As an aside, it might likewise be helpful to update the vec_set
patterns
to allow for memory operands, implemented via vld1.)
Agreed.
The attached patch adds support for vld1 in vec_set as well. (See attached file: diff-gcc-arm-vecsetextractmem)
As a side note, I noticed that the vmov instructions output via vec_set and vec_extract use %? to support conditions in ARM mode; most of the rest of neon.md doesn't use %?, in particular, the patterns where I copied vst1/vld1 from didn't ...
What's the issue with this? Should there be %? everywhere, or at least used with the vst1/vld1 instructions in vec_set/vec_extract?
B.t.w. I'm wondering how I can properly test:
- that the NEON intrinsics still work
The intrinsics tests in gcc.target/arm and gcc.target/arm/neon are the only that we have today that we can run reliably. The other option is the tests from ST but as you've discovered they need the ARM tools.
Those show no regressions with (either version of) the patch.
- that everything works on big-endian
We'll have to put together a big-endian toolchain and test on hardware. qemu doesn't support BE8 mode today and in today's world this is best tested bare-metal on a Fast-Model or on actual hardware both of which will need bring up in terms of dejagnu glue. We're not set up for either today and will probably have to spend some effort in doing this.
Ah, OK. Maybe another topic for discussion next week ...
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294
On 1 February 2012 19:33, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote on 01.02.2012 16:28:04:
This patch should be queued for 4.8 .Sounds sensible to me.
OK, thanks for the review!
(As an aside, it might likewise be helpful to update the vec_set
patterns
to allow for memory operands, implemented via vld1.)
Agreed.
The attached patch adds support for vld1 in vec_set as well. (See attached file: diff-gcc-arm-vecsetextractmem)
As a side note, I noticed that the vmov instructions output via vec_set and vec_extract use %? to support conditions in ARM mode; most of the rest of neon.md doesn't use %?, in particular, the patterns where I copied vst1/vld1 from didn't ...
Thanks for following up with that - yes you make a useful small point here. Neon doesn't support conditionals so you are right the %? is superfluous there. In theory you can put the neon instructions in IT blocks but that is deprecated by the architecture and is in general not recommended. In fact on certain implementations while these instructions will be supported "architecturally" performance will be terrible. However because the pattern has a non- "none" neon_type attribute and is not marked predicable, conditional execution will never pick this up.
So yes there's no need to perpetuate that.
What's the issue with this? Should there be %? everywhere, or at least used with the vst1/vld1 instructions in vec_set/vec_extract?
B.t.w. I'm wondering how I can properly test:
- that the NEON intrinsics still work
The intrinsics tests in gcc.target/arm and gcc.target/arm/neon are the only that we have today that we can run reliably. The other option is the tests from ST but as you've discovered they need the ARM tools.
Those show no regressions with (either version of) the patch.
I assume by this you mean the intrinsics tests in the testsuite.
- that everything works on big-endian
We'll have to put together a big-endian toolchain and test on hardware. qemu doesn't support BE8 mode today and in today's world this is best tested bare-metal on a Fast-Model or on actual hardware both of which will need bring up in terms of dejagnu glue. We're not set up for either today and will probably have to spend some effort in doing this.
Ah, OK. Maybe another topic for discussion next week ...
Indeed , yes. :) It would be good to include Matt in this discussion when he's there next week as well.
cheers Ramana
On 1 February 2012 23:08, Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote:
On 1 February 2012 19:33, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote on 01.02.2012 16:28:04:
This patch should be queued for 4.8 .Sounds sensible to me.
OK, thanks for the review!
(As an aside, it might likewise be helpful to update the vec_set
patterns
to allow for memory operands, implemented via vld1.)
Agreed.
The attached patch adds support for vld1 in vec_set as well. (See attached file: diff-gcc-arm-vecsetextractmem)
As a side note, I noticed that the vmov instructions output via vec_set and vec_extract use %? to support conditions in ARM mode; most of the rest of neon.md doesn't use %?, in particular, the patterns where I copied vst1/vld1 from didn't ...
Thanks for following up with that - yes you make a useful small point here. Neon doesn't support conditionals so you are right the %? is superfluous there. In theory you can put the neon instructions in IT blocks but that is deprecated by the architecture and is in general not recommended. In fact on certain implementations while these instructions will be supported "architecturally" performance will be terrible. However because the pattern has a non- "none" neon_type attribute and is not marked predicable, conditional execution will never pick this up.
I realize I was wrong in this case - the original pattern was marked predicable. I want to go back and think about this one.
Ramana
On 1 February 2012 19:33, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote on 01.02.2012 16:28:04:
This patch should be queued for 4.8 .Sounds sensible to me.
OK, thanks for the review!
(As an aside, it might likewise be helpful to update the vec_set
patterns
to allow for memory operands, implemented via vld1.)
Agreed.
The attached patch adds support for vld1 in vec_set as well. (See attached file: diff-gcc-arm-vecsetextractmem)
As a side note, I noticed that the vmov instructions output via vec_set and vec_extract use %? to support conditions in ARM mode; most of the rest of neon.md doesn't use %?, in particular, the patterns where I copied vst1/vld1 from didn't ...
To follow up from last night - please remove the "predicable" attribute on these patterns. Conditional neon is deprecated.
The patterns affected are from a quick grep :
(vec_set<mode>_internal, (vec_setv2di_internal) (vec_set<mode>) (vec_extract<mode>) (vec_extractv2di) (vec_init<mode>) (neon_vget_lane<mode>_zext_internal) (neon_vget_lane<mode>_sext_internal) (neon_vget_lane<mode>_zext_internal) (neon_vget_lane<mode>) (neon_vdup_n<mode>) (neon_vdup_ndi)
There should be never a chance to generate conditional Neon instructions.
I'd like to take this up further on gcc-patches for any further review / comments.
Ramana
linaro-toolchain@lists.linaro.org