Re: [PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)

1 Dec 2010

      On 30 November 2010 14:51, Julian Brown julian@codesourcery.com wrote:
...
...
...
...
...
I think we need to somehow enhance MEM_REF, or maybe generate a
MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most
of the compiler seems like a good plan.
Is it possible to have a list of MEM_REFs and a builtin after them:
v0 = MEM_REF (addr)
v1 = MEM_REF (addr + 8B)
v2 = MEM_REF (addr + 16B)
builtin (v0, v1, v2, stride=3, reg_stride=1,...)
Would the builtin be changing the semantics of the preceding MEM_REF
codes? If so I don't like this much (the potential for the builtin
getting "separated" from the MEM_REFS by optimisation passes and causing
subtle breakage seems too high). But if eliding the builtin would
simply cause the code to degrade into separate loads/stores, I guess
that would be OK.
The meaning of the builtin (or maybe a new tree code would be better?)
is that the elements of v0, v1 and v2 are deinterleaved. I wanted the
MEM_REFs, since we actually have three data accesses here, and
something (builtin or tree code) to indicate the deinterleaving. Since
the vectors are passed to the builtin, I don't think it's a problem if
the statements get separated. When the expander sees the builtin, it
has to remove the loads it created for the MEM_REFs and create a new
"vector load multiple and deinterleave". Is that possible?
If there are other uses of the accesses, i.e., MEM_REF (addr) is used
somewhere else in the loop, the vectorizer will have to create another
v0' = MEM_REF (addr) for it.
...
...
...
to be expanded into:
<regular RTL mem refs> (addr)
NOTE (...)
I guess we can do something similar to load_multiple here (but it
probably requires changes in neon.md as well).
Yeah, I like that idea. So we might have something like:
(parallel
   [(set (reg) (mem addr))
    (set (reg+2) (mem (plus addr 8)))
    (set (reg+4) (mem (plus addr 16)))])
That should work fine I think -- but how to do register-allocation on
these values remains an open problem (since GCC has no direct way of
saying "allocate this set of registers contiguously"). ARM load & store
multiple are only used in a couple of places, where hard regnos are
already known, so aren't directly comparable.
PowerPC also has load/store multiple, but I guess they are generated
in the same phase as for ARM. Maybe there are other architectures that
do that allocate contiguous register but earlier?
Also, as you probably know, we have to keep in mind that the registers
do not have to be contiguous: d, d+2, d+4 is fine as well - and this
case is very important for us since this is the way to work with
quadword vectors.
...
Choices I can think of are:

Use "big integer" modes (TImode, OImode, XImode...), as in the

present patterns, but with (post-reload?) splitters to create
  the parallel RTX as above. So, prior to reload, the RTL would look
  different (like the existing patterns, with an UNSPEC?), so as to
  allocate the "big" integer to consecutive vector registers. This
  doesn't really gain anything, and I think it'd really be best if
  those types could be removed from the NEON backend anyway.

Use "big vector" modes (representing multiple vector registers --

up to e.g. V16SImode). We'd have to make sure these *never* end up in
  core registers somehow, since that would certainly lead to reload
  failure. Then the parallel might be written something like this (for
  vld1 with four D registers):
(parallel
     [(use (reg:V8SI 0))
      (set (subreg:V2SI (match_dup 0) 0) (mem))
      (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8))
      (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16))
      (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]
Or perhaps the same but with vec_select instead of subreg (the
  patch I'm working on suggests that subreg on vector types works fine,
  most of the time). This would require altering vld*/vst* intrinsics
  -- but that's something I'm planning to do anyway, and probably
  also tweaking the way "foo.val[X]" access (again for intrinsics) is
  expanded in the front-ends, as a NEON-specific hack. The main
  worry is that I'm not sure how well the register allocator & reload
  will handle these large vectors.
The vectorizer would need a way of extracting elements or vectors
  from these extra-wide vectors: in terms of RTL, subreg or vec_select
  should suffice for that.
Why does the vectorizer have to know about this?
...

Treat vld* & vst* like (or even as?) libcalls. Regular function

calls have kind-of similar constraints on register usage to these
  multi-register operations (i.e. arguments must be in consecutive
  registers), so as a hack we could reuse some of that mechanism (or
  create a similar mechanism), provided that we can live with vld* &
  vst* always working on a fixed list of registers. E.g. we'd end up
  with RTL:
Store,
(set (reg:V2SI d0) (...))
   (set (reg:V2SI d1) (...))
   (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))
Load,
(parallel (set (reg:V2SI d0) (call_insn (fake_vld1)))
             (set (reg:V2SI d1) (...)))
   (set (...) (reg:V2SI d0))
   (set (...) (reg:V2SI d1))
(Not necessarily with actual call_insn RTL: I just wrote it like
  that to illustrate the general idea.)
One could envisage further hacks to lift the restriction on the
  fixed registers used, e.g. by allocating them in a round-robin
  fashion per function. Doing things this way would also require
  intrinsic-expansion changes, so isn't necessarily any easier than
  (2).
Is having a scan for special instructions before/after/during register
allocation not an option?
Thanks,
Ira
...
I think I like the second choice best: I might experiment (from the
intrinsics side), to see how feasible it looks.
Julian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)