Re: Improved use of vst1 via vec_extract (PR 51819 test case)

1 Feb 2012

      Hi Uli,
Thanks for the detailed analysis.
...
The reason for this particular code sequence turns out to be as follows:
The middle end tries to store the LSB vector lane to memory, and uses the
vec_extract named pattern to do so.  This pattern currently only supports
an "s_register_operand" destination, and is implemented via vmov to a core
register.  The contents of that register are then stored to memory.   Now
why does any vst1 instruction show up?  This is because combine is able to
merge the vec_extract back into the store pattern and ends up with a
pattern that matches neon_vst1_lanev4si.   Note that the latter pattern is
actually intended to implement NEON built-ins (vst1_lane_... etc).
Now there seem to be two problems with this scenario:
First of all, the neon_vst1_lane<mode> patterns seem to be actually
incorrect on big-endian systems due to lane-numbering problems.  As I
understand it, all NEON intrinsics are supposed to take lane numbers
according to the NEON ordering scheme, while the vec_select RTX pattern is
defined to take lane numbers according to the in-memory order.  Those
disagree in the big-endian case.  All other patterns implementing NEON
intrinsics therefore avoid using vec_select, and instead resort to using
special UNSPEC codes -- the sole exception to this happens to be
neon_vst1_lane<mode>.  It would appear that this is actually incorrect, and
the pattern ought to use a UNSPEC_VST1_LANE unspec instead (that UNSPEC
code is already defined, but nowhere used)
This looks like it is broken from Day1 for big-endian.
...
Now if we make that change, then the above code sequence will contain no
vst1 any more.  But in any case, expanding first into a vec_extract
followed by a store pattern, only to rely on combine to merge them back
together, is a suboptimal approach.  One obvious drawback is that the
auto-inc-dec pass runs before reload, and therefore only sees plain stores
-- with no reason whatsoever to attempt to introduce post-inc operations.
Also, just in general it normally works out best to allow the final choice
between register and memory operands to be make in reload ...
Therefore, I think the vec_extract patterns ought to support *both*
register and memory destination operands, and implement those via vmov or
vst1 in final code generation, as appropriate.  This means that we can make
the choice in reload, guided as usual by alternative ordering and/or
penalties -- for example, we can choose to reload the address and still use
vst1 over reloading the contents to a core register and then using an
offsetted store.
This patch should be queued for 4.8 .Sounds sensible to me.
...
Finally, this sequence will also allow the auto-inc-dec pass to do a better
job.  The current in-tree pass doesn't manage unfortunately, but with
Richard's proposed replacement, assisted by a tweak to the cost function to
make sure the (future) address reload is "priced in" correctly, I'm now
seeing what appears to be the optimal code sequence:
vst1.32 {d6[0]}, [r0:64]!       @ 30    vec_extractv4si/1
[length = 4]
       vst1.32 {d22[0]}, [r0]! @ 56    vec_extractv4si/1       [length =
4]
       vst1.32 {d2[0]}, [r0:64]!       @ 82    vec_extractv4si/1
[length = 4]
       vst1.32 {d4[0]}, [r0]   @ 108   vec_extractv4si/1       [length =
4]
       bx      lr      @ 116   *thumb2_return  [length = 12]
(Again the :64 is wrong; it's already fixed on mainline but I haven't
pulled that change in yet.)
The attached patch implements the above-mentioned changes.   Any comments?
I'll try to get some performance numbers as well before moving forward with
the patch ...
(As an aside, it might likewise be helpful to update the vec_set patterns
to allow for memory operands, implemented via vld1.)
Agreed.
...
(See attached file: diff-gcc-arm-vecextractmem)
B.t.w. I'm wondering how I can properly test:

that the NEON intrinsics still work

The intrinsics tests in gcc.target/arm and gcc.target/arm/neon are the
only that we have today that we can run reliably. The other option is
the tests from ST but as you've discovered they need the ARM tools.
...

that everything works on big-endian

We'll have to put together a big-endian toolchain and test on
hardware. qemu doesn't support BE8 mode today and in today's world
this is best tested bare-metal on a Fast-Model or on actual hardware
both of which will need bring up in terms of dejagnu glue. We're not
set up for either today and will probably have to spend some effort in
doing this.
I hope this helps.
cheers
Ramana

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Improved use of vst1 via vec_extract (PR 51819 test case)