Hi,
Here's a work-in-progress patch which fixes many execution failures seen in big-endian mode when -mvectorize-with-neon-quad is in effect (which is soon to be the default, if we stick to the current plan). But, it's pretty hairy, and I'm not at all convinced it's not working "for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian mode. This patch works with the assumption that quad-word vectors in big-endian mode are in "vldm" order (i.e. with constituent double-words in little-endian order: see previous discussions). But, that's pretty confusing, leads to less than optimal code, and is bound to cause more problems in the future. So I'm not sure how much effort to expend on making it work right, given that we might be throwing that vector ordering away in the future (at least in some cases: see below).
The "problem" patterns are as follows.
* Full-vector shifts: these don't work with big-endian vldm-order quad vectors. For now, I've disabled them, although they could potentially be implemented using vtbl (at some cost).
* Widening moves (unpacks) & widening multiplies: when widening from D-reg to Q-reg size, we must swap double-words in the result (I've done this with vext). This seems to work fine, but what "hi" and "lo" refer to is rather muddled (in my head!). Also they should be expanders instead of emitting multiple assembler insns.
* Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as suggested by Paul). These seem to work fine.
* Reduction operations: when reducing Q-reg values, GCC currently tries to extract the result from the "wrong half" of the reduced vector. The fix in the attached patch is rather dubious, but seems to work (I'd like to understand why better).
We can sort those bits out, but the question is, do we want to go that route? Vectors are used in three quite distinct ways by GCC:
1. By the vectorizer.
2. By the NEON intrinsics.
3. By the "generic vector" support.
For the first of these, I think we can get away with changing the vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so that vector registers will hold elements in memory order -- so, all the contortions in the attached patch will be unnecessary. ABI issues are irrelevant, since vectors are "invisible" at the source code layer generally, including at ABI boundaries.
For the second, intrinsics, we should do exactly what the user requests: so, vectors are essentially treated as opaque objects. This isn't a problem as such, but might mean that instruction patterns written using "canonical" RTL for the vectorizer can't be shared with intrinsics when the order of elements matters. (I'm not sure how many patterns this would refer to at present; possibly none.)
The third case would continue to use "vldm" ordering, so if users inadvertantly write code such as:
res = vaddq_u32 (*foo, bar);
instead of writing an explicit vld* intrinsic (for the load of *foo), the result might be different from what they expect. It'd be nice to diagnose such code as erroneous, but that's another issue.
The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously.
So, anyway, back to the patch in question. The choices are, I think:
1. Apply as-is (after I've ironed out the wrinkles), and then remove the "ugly" bits at a later point when vectorizer "array load/store" support is implemented.
2. Apply a version which simply disables all the troublesome patterns until the same support appears.
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good code for the quad-word vec_pack_trunc_<mode> pattern. It would eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/ * config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode of vector registers. * config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in big-endian mode. (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>) (reduc_umin_<mode>, reduc_umax_<mode>) (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>) (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>) (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle big-endian mode for quad-word vectors.
Hi Julian,
On 12 November 2010 17:49, Julian Brown julian@codesourcery.com wrote:
For the first of these, I think we can get away with changing the vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so that vector registers will hold elements in memory order -- so, all the contortions in the attached patch will be unnecessary. ABI issues are irrelevant, since vectors are "invisible" at the source code layer generally, including at ABI boundaries.
... The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously.
Do you think now that the changes in GIMPLE and RTL (a function attached to each vector) are unnecessary?
From the vectorizer point of view, target hooks look like the easiest
solution (yet ugly). I am trying to think about something else, but nothing really makes sense.
So, anyway, back to the patch in question. The choices are, I think:
- Apply as-is (after I've ironed out the wrinkles), and then remove
the "ugly" bits at a later point when vectorizer "array load/store" support is implemented.
- Apply a version which simply disables all the troublesome
patterns until the same support appears.
I slightly prefer the first one, it's kind of an incremental solution.
Ira
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good code for the quad-word vec_pack_trunc_<mode> pattern. It would eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/
- config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
of vector registers.
- config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
big-endian mode. (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>) (reduc_umin_<mode>, reduc_umax_<mode>) (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>) (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>) (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle big-endian mode for quad-word vectors.
On Mon, 15 Nov 2010 10:12:26 +0200 Ira Rosen ira.rosen@linaro.org wrote:
Hi Julian,
On 12 November 2010 17:49, Julian Brown julian@codesourcery.com wrote:
... The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously.
Do you think now that the changes in GIMPLE and RTL (a function attached to each vector) are unnecessary?
Yes, I now think they're unnecessary, at least in theory. In practice if patterns are shared between the vectorizer and generic vector machinery then they may need some attention -- e.g. we wouldn't want to use vec_shl_<mode>/vec_shr_<mode> still for big-endian generic vectors, since they'd permute the result. Maybe we could use a target macro named something like *_GENERIC_VECTORS_IN_ARRAY_ORDER to control whether such (element-ordering dependent) patterns could be used for generic vectors.
From the vectorizer point of view, target hooks look like the easiest solution (yet ugly). I am trying to think about something else, but nothing really makes sense.
Yeah, I was thinking something along those lines. So we might have TARGET_VECTORIZER_ARRAY_LOADS or something to use new "standard names" for loading/storing vectors in array order, rather than using plain (mem (foo)).
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously. Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
So, anyway, back to the patch in question. The choices are, I think:
- Apply as-is (after I've ironed out the wrinkles), and then
remove the "ugly" bits at a later point when vectorizer "array load/store" support is implemented.
- Apply a version which simply disables all the troublesome
patterns until the same support appears.
I slightly prefer the first one, it's kind of an incremental solution.
OK. I'll try to figure out the wrinkles.
Thanks,
Julian
On 15 November 2010 17:33, Julian Brown julian@codesourcery.com wrote:
On Mon, 15 Nov 2010 10:12:26 +0200 Ira Rosen ira.rosen@linaro.org wrote:
Hi Julian,
On 12 November 2010 17:49, Julian Brown julian@codesourcery.com wrote:
... The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously.
Do you think now that the changes in GIMPLE and RTL (a function attached to each vector) are unnecessary?
Yes, I now think they're unnecessary, at least in theory. In practice if patterns are shared between the vectorizer and generic vector machinery then they may need some attention -- e.g. we wouldn't want to use vec_shl_<mode>/vec_shr_<mode> still for big-endian generic vectors, since they'd permute the result. Maybe we could use a target macro named something like *_GENERIC_VECTORS_IN_ARRAY_ORDER to control whether such (element-ordering dependent) patterns could be used for generic vectors.
What do you mean by generic vectors? I am used to call generic vectors to , for example, vector of 4 chars in 32-bit int, when there is no vector unit, but I guess you are talking about intrinsics.
From the vectorizer point of view, target hooks look like the easiest solution (yet ugly). I am trying to think about something else, but nothing really makes sense.
Yeah, I was thinking something along those lines. So we might have TARGET_VECTORIZER_ARRAY_LOADS or something to use new "standard names" for loading/storing vectors in array order, rather than using plain (mem (foo)).
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously. Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
So, anyway, back to the patch in question. The choices are, I think:
- Apply as-is (after I've ironed out the wrinkles), and then
remove the "ugly" bits at a later point when vectorizer "array load/store" support is implemented.
- Apply a version which simply disables all the troublesome
patterns until the same support appears.
I slightly prefer the first one, it's kind of an incremental solution.
OK. I'll try to figure out the wrinkles.
I hope it's not too much work, because otherwise there is no point, since we are going to remove it later anyway.
Thanks, Ira
Thanks,
Julian
On Wed, 17 Nov 2010 12:16:34 +0200 Ira Rosen ira.rosen@linaro.org wrote:
On 15 November 2010 17:33, Julian Brown julian@codesourcery.com wrote:
On Mon, 15 Nov 2010 10:12:26 +0200 Ira Rosen ira.rosen@linaro.org wrote:
Do you think now that the changes in GIMPLE and RTL (a function attached to each vector) are unnecessary?
Yes, I now think they're unnecessary, at least in theory. In practice if patterns are shared between the vectorizer and generic vector machinery then they may need some attention -- e.g. we wouldn't want to use vec_shl_<mode>/vec_shr_<mode> still for big-endian generic vectors, since they'd permute the result. Maybe we could use a target macro named something like *_GENERIC_VECTORS_IN_ARRAY_ORDER to control whether such (element-ordering dependent) patterns could be used for generic vectors.
What do you mean by generic vectors? I am used to call generic vectors to , for example, vector of 4 chars in 32-bit int, when there is no vector unit, but I guess you are talking about intrinsics.
I mean the user-visible support for directly writing cross-platform vector code, as described in:
http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/Vector-Extensions.html
I.e., declaring a type using the vector_size attribute, then using regular arithmetic operators on it (some work is being done at present to extend the number of available operations a little). AFAIK use of this feature isn't especially widespread, nor particularly well-specified: I mainly brought it up to mention the possibility of inadvertently mixing "generic vector" dereferences with NEON intrinsics.
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously. Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most of the compiler seems like a good plan.
I slightly prefer the first one, it's kind of an incremental solution.
OK. I'll try to figure out the wrinkles.
I hope it's not too much work, because otherwise there is no point, since we are going to remove it later anyway.
Shouldn't be more than 1-2 days, but I've been distracted by other bugs...
Julian
On 17 November 2010 13:21, Julian Brown julian@codesourcery.com wrote:
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously.
Alignment info is kept in struct ptr_info_def. Is it necessary to represent stride? Multiple loads/stores seem the most complicated part to me. In neon.md vld is implemented with output_asm_insn. Is it going to change? Does this assure consecutive (or stride two) registers?
Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most of the compiler seems like a good plan.
Is it possible to have a list of MEM_REFs and a builtin after them:
v0 = MEM_REF (addr) v1 = MEM_REF (addr + 8B) v2 = MEM_REF (addr + 16B) builtin (v0, v1, v2, stride=3, reg_stride=1,...)
to be expanded into:
<regular RTL mem refs> (addr) NOTE (...)
and then combined into vld3?
Ira
On 22 November 2010 13:46, Ira Rosen ira.rosen@linaro.org wrote:
On 17 November 2010 13:21, Julian Brown julian@codesourcery.com wrote:
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously.
Alignment info is kept in struct ptr_info_def. Is it necessary to represent stride? Multiple loads/stores seem the most complicated part to me. In neon.md vld is implemented with output_asm_insn. Is it going to change? Does this assure consecutive (or stride two) registers?
Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most of the compiler seems like a good plan.
Is it possible to have a list of MEM_REFs and a builtin after them:
v0 = MEM_REF (addr) v1 = MEM_REF (addr + 8B) v2 = MEM_REF (addr + 16B) builtin (v0, v1, v2, stride=3, reg_stride=1,...)
to be expanded into:
<regular RTL mem refs> (addr) NOTE (...)
I guess we can do something similar to load_multiple here (but it probably requires changes in neon.md as well).
Ira
and then combined into vld3?
Ira
On Tue, 30 Nov 2010 12:25:18 +0200 Ira Rosen ira.rosen@linaro.org wrote:
On 22 November 2010 13:46, Ira Rosen ira.rosen@linaro.org wrote:
On 17 November 2010 13:21, Julian Brown julian@codesourcery.com wrote:
We'd need to figure out what the RTL for such loads/stores should look like, and whether it can represent alignment constraints, or strides, or loads/stores of multiple vector registers simulateously.
Alignment info is kept in struct ptr_info_def. Is it necessary to represent stride? Multiple loads/stores seem the most complicated part to me. In neon.md vld is implemented with output_asm_insn. Is it going to change? Does this assure consecutive (or stride two) registers?
Getting it right might be a bit awkward, especially if we want to consider a scope wider than just NEON, i.e. other vector architectures also.
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most of the compiler seems like a good plan.
Is it possible to have a list of MEM_REFs and a builtin after them:
v0 = MEM_REF (addr) v1 = MEM_REF (addr + 8B) v2 = MEM_REF (addr + 16B) builtin (v0, v1, v2, stride=3, reg_stride=1,...)
Would the builtin be changing the semantics of the preceding MEM_REF codes? If so I don't like this much (the potential for the builtin getting "separated" from the MEM_REFS by optimisation passes and causing subtle breakage seems too high). But if eliding the builtin would simply cause the code to degrade into separate loads/stores, I guess that would be OK.
to be expanded into:
<regular RTL mem refs> (addr) NOTE (...)
I guess we can do something similar to load_multiple here (but it probably requires changes in neon.md as well).
Yeah, I like that idea. So we might have something like:
(parallel [(set (reg) (mem addr)) (set (reg+2) (mem (plus addr 8))) (set (reg+4) (mem (plus addr 16)))])
That should work fine I think -- but how to do register-allocation on these values remains an open problem (since GCC has no direct way of saying "allocate this set of registers contiguously"). ARM load & store multiple are only used in a couple of places, where hard regnos are already known, so aren't directly comparable.
Choices I can think of are:
1. Use "big integer" modes (TImode, OImode, XImode...), as in the present patterns, but with (post-reload?) splitters to create the parallel RTX as above. So, prior to reload, the RTL would look different (like the existing patterns, with an UNSPEC?), so as to allocate the "big" integer to consecutive vector registers. This doesn't really gain anything, and I think it'd really be best if those types could be removed from the NEON backend anyway.
2. Use "big vector" modes (representing multiple vector registers -- up to e.g. V16SImode). We'd have to make sure these *never* end up in core registers somehow, since that would certainly lead to reload failure. Then the parallel might be written something like this (for vld1 with four D registers):
(parallel [(use (reg:V8SI 0)) (set (subreg:V2SI (match_dup 0) 0) (mem)) (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8)) (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16)) (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]
Or perhaps the same but with vec_select instead of subreg (the patch I'm working on suggests that subreg on vector types works fine, most of the time). This would require altering vld*/vst* intrinsics -- but that's something I'm planning to do anyway, and probably also tweaking the way "foo.val[X]" access (again for intrinsics) is expanded in the front-ends, as a NEON-specific hack. The main worry is that I'm not sure how well the register allocator & reload will handle these large vectors.
The vectorizer would need a way of extracting elements or vectors from these extra-wide vectors: in terms of RTL, subreg or vec_select should suffice for that.
3. Treat vld* & vst* like (or even as?) libcalls. Regular function calls have kind-of similar constraints on register usage to these multi-register operations (i.e. arguments must be in consecutive registers), so as a hack we could reuse some of that mechanism (or create a similar mechanism), provided that we can live with vld* & vst* always working on a fixed list of registers. E.g. we'd end up with RTL:
Store,
(set (reg:V2SI d0) (...)) (set (reg:V2SI d1) (...)) (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))
Load,
(parallel (set (reg:V2SI d0) (call_insn (fake_vld1))) (set (reg:V2SI d1) (...))) (set (...) (reg:V2SI d0)) (set (...) (reg:V2SI d1))
(Not necessarily with actual call_insn RTL: I just wrote it like that to illustrate the general idea.)
One could envisage further hacks to lift the restriction on the fixed registers used, e.g. by allocating them in a round-robin fashion per function. Doing things this way would also require intrinsic-expansion changes, so isn't necessarily any easier than (2).
I think I like the second choice best: I might experiment (from the intrinsics side), to see how feasible it looks.
Julian
On 30 November 2010 14:51, Julian Brown julian@codesourcery.com wrote:
I think we need to somehow enhance MEM_REF, or maybe generate a MEM_REF for the first vector and a builtin after it.
Yeah, keeping these things looking like memory references to most of the compiler seems like a good plan.
Is it possible to have a list of MEM_REFs and a builtin after them:
v0 = MEM_REF (addr) v1 = MEM_REF (addr + 8B) v2 = MEM_REF (addr + 16B) builtin (v0, v1, v2, stride=3, reg_stride=1,...)
Would the builtin be changing the semantics of the preceding MEM_REF codes? If so I don't like this much (the potential for the builtin getting "separated" from the MEM_REFS by optimisation passes and causing subtle breakage seems too high). But if eliding the builtin would simply cause the code to degrade into separate loads/stores, I guess that would be OK.
The meaning of the builtin (or maybe a new tree code would be better?) is that the elements of v0, v1 and v2 are deinterleaved. I wanted the MEM_REFs, since we actually have three data accesses here, and something (builtin or tree code) to indicate the deinterleaving. Since the vectors are passed to the builtin, I don't think it's a problem if the statements get separated. When the expander sees the builtin, it has to remove the loads it created for the MEM_REFs and create a new "vector load multiple and deinterleave". Is that possible?
If there are other uses of the accesses, i.e., MEM_REF (addr) is used somewhere else in the loop, the vectorizer will have to create another v0' = MEM_REF (addr) for it.
to be expanded into:
<regular RTL mem refs> (addr) NOTE (...)
I guess we can do something similar to load_multiple here (but it probably requires changes in neon.md as well).
Yeah, I like that idea. So we might have something like:
(parallel [(set (reg) (mem addr)) (set (reg+2) (mem (plus addr 8))) (set (reg+4) (mem (plus addr 16)))])
That should work fine I think -- but how to do register-allocation on these values remains an open problem (since GCC has no direct way of saying "allocate this set of registers contiguously"). ARM load & store multiple are only used in a couple of places, where hard regnos are already known, so aren't directly comparable.
PowerPC also has load/store multiple, but I guess they are generated in the same phase as for ARM. Maybe there are other architectures that do that allocate contiguous register but earlier?
Also, as you probably know, we have to keep in mind that the registers do not have to be contiguous: d, d+2, d+4 is fine as well - and this case is very important for us since this is the way to work with quadword vectors.
Choices I can think of are:
- Use "big integer" modes (TImode, OImode, XImode...), as in the
present patterns, but with (post-reload?) splitters to create the parallel RTX as above. So, prior to reload, the RTL would look different (like the existing patterns, with an UNSPEC?), so as to allocate the "big" integer to consecutive vector registers. This doesn't really gain anything, and I think it'd really be best if those types could be removed from the NEON backend anyway.
- Use "big vector" modes (representing multiple vector registers --
up to e.g. V16SImode). We'd have to make sure these *never* end up in core registers somehow, since that would certainly lead to reload failure. Then the parallel might be written something like this (for vld1 with four D registers):
(parallel [(use (reg:V8SI 0)) (set (subreg:V2SI (match_dup 0) 0) (mem)) (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8)) (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16)) (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]
Or perhaps the same but with vec_select instead of subreg (the patch I'm working on suggests that subreg on vector types works fine, most of the time). This would require altering vld*/vst* intrinsics -- but that's something I'm planning to do anyway, and probably also tweaking the way "foo.val[X]" access (again for intrinsics) is expanded in the front-ends, as a NEON-specific hack. The main worry is that I'm not sure how well the register allocator & reload will handle these large vectors.
The vectorizer would need a way of extracting elements or vectors from these extra-wide vectors: in terms of RTL, subreg or vec_select should suffice for that.
Why does the vectorizer have to know about this?
- Treat vld* & vst* like (or even as?) libcalls. Regular function
calls have kind-of similar constraints on register usage to these multi-register operations (i.e. arguments must be in consecutive registers), so as a hack we could reuse some of that mechanism (or create a similar mechanism), provided that we can live with vld* & vst* always working on a fixed list of registers. E.g. we'd end up with RTL:
Store,
(set (reg:V2SI d0) (...)) (set (reg:V2SI d1) (...)) (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))
Load,
(parallel (set (reg:V2SI d0) (call_insn (fake_vld1))) (set (reg:V2SI d1) (...))) (set (...) (reg:V2SI d0)) (set (...) (reg:V2SI d1))
(Not necessarily with actual call_insn RTL: I just wrote it like that to illustrate the general idea.)
One could envisage further hacks to lift the restriction on the fixed registers used, e.g. by allocating them in a round-robin fashion per function. Doing things this way would also require intrinsic-expansion changes, so isn't necessarily any easier than (2).
Is having a scan for special instructions before/after/during register allocation not an option?
Thanks, Ira
I think I like the second choice best: I might experiment (from the intrinsics side), to see how feasible it looks.
Julian
On Wed, 1 Dec 2010 11:16:16 +0200 Ira Rosen ira.rosen@linaro.org wrote:
v0 = MEM_REF (addr) v1 = MEM_REF (addr + 8B) v2 = MEM_REF (addr + 16B) builtin (v0, v1, v2, stride=3, reg_stride=1,...)
Would the builtin be changing the semantics of the preceding MEM_REF codes? If so I don't like this much (the potential for the builtin getting "separated" from the MEM_REFS by optimisation passes and causing subtle breakage seems too high). But if eliding the builtin would simply cause the code to degrade into separate loads/stores, I guess that would be OK.
The meaning of the builtin (or maybe a new tree code would be better?) is that the elements of v0, v1 and v2 are deinterleaved. I wanted the MEM_REFs, since we actually have three data accesses here, and something (builtin or tree code) to indicate the deinterleaving. Since the vectors are passed to the builtin, I don't think it's a problem if the statements get separated. When the expander sees the builtin, it has to remove the loads it created for the MEM_REFs and create a new "vector load multiple and deinterleave". Is that possible?
I can imagine it might get awkward if the builtin ends up in a different basic block from the MEMs -- but I don't know enough about the tree optimization passes to say if that's likely to ever happen.
If there are other uses of the accesses, i.e., MEM_REF (addr) is used somewhere else in the loop, the vectorizer will have to create another v0' = MEM_REF (addr) for it.
OK, I see. I think this representation sounds quite reasonable to me, FWIW.
to be expanded into:
<regular RTL mem refs> (addr) NOTE (...)
I guess we can do something similar to load_multiple here (but it probably requires changes in neon.md as well).
Yeah, I like that idea. So we might have something like:
(parallel [(set (reg) (mem addr)) (set (reg+2) (mem (plus addr 8))) (set (reg+4) (mem (plus addr 16)))])
That should work fine I think -- but how to do register-allocation on these values remains an open problem (since GCC has no direct way of saying "allocate this set of registers contiguously"). ARM load & store multiple are only used in a couple of places, where hard regnos are already known, so aren't directly comparable.
PowerPC also has load/store multiple, but I guess they are generated in the same phase as for ARM. Maybe there are other architectures that do that allocate contiguous register but earlier?
I don't know about other architectures which do that.
Also, as you probably know, we have to keep in mind that the registers do not have to be contiguous: d, d+2, d+4 is fine as well - and this case is very important for us since this is the way to work with quadword vectors.
Right...
Choices I can think of are:
- Use "big integer" modes (TImode, OImode, XImode...), as in the
present patterns, but with (post-reload?) splitters to create the parallel RTX as above. So, prior to reload, the RTL would look different (like the existing patterns, with an UNSPEC?), so as to allocate the "big" integer to consecutive vector registers. This doesn't really gain anything, and I think it'd really be best if those types could be removed from the NEON backend anyway.
- Use "big vector" modes (representing multiple vector registers --
up to e.g. V16SImode). We'd have to make sure these *never* end up in core registers somehow, since that would certainly lead to reload failure. Then the parallel might be written something like this (for vld1 with four D registers):
(parallel [(use (reg:V8SI 0)) (set (subreg:V2SI (match_dup 0) 0) (mem)) (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8)) (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16)) (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]
Or perhaps the same but with vec_select instead of subreg (the patch I'm working on suggests that subreg on vector types works fine, most of the time). This would require altering vld*/vst* intrinsics -- but that's something I'm planning to do anyway, and probably also tweaking the way "foo.val[X]" access (again for intrinsics) is expanded in the front-ends, as a NEON-specific hack. The main worry is that I'm not sure how well the register allocator & reload will handle these large vectors.
The vectorizer would need a way of extracting elements or vectors from these extra-wide vectors: in terms of RTL, subreg or vec_select should suffice for that.
Why does the vectorizer have to know about this?
- Treat vld* & vst* like (or even as?) libcalls. Regular function
calls have kind-of similar constraints on register usage to these multi-register operations (i.e. arguments must be in consecutive registers), so as a hack we could reuse some of that mechanism (or create a similar mechanism), provided that we can live with vld* & vst* always working on a fixed list of registers. E.g. we'd end up with RTL:
Store,
(set (reg:V2SI d0) (...)) (set (reg:V2SI d1) (...)) (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))
Load,
(parallel (set (reg:V2SI d0) (call_insn (fake_vld1))) (set (reg:V2SI d1) (...))) (set (...) (reg:V2SI d0)) (set (...) (reg:V2SI d1))
(Not necessarily with actual call_insn RTL: I just wrote it like that to illustrate the general idea.)
One could envisage further hacks to lift the restriction on the fixed registers used, e.g. by allocating them in a round-robin fashion per function. Doing things this way would also require intrinsic-expansion changes, so isn't necessarily any easier than (2).
Is having a scan for special instructions before/after/during register allocation not an option?
I guess it is -- I think a pass before register allocation is broadly equivalent to choice (3) above -- knowing that the register allocator isn't smart enough to deal with contiguous register sets, so replacing those registers with hard regs (and copying values in/out as appropriate) before allocation proper.
The ideal solution would be to deal with those registers during the register allocation phase, i.e. by adding generalised support for allocating contiguous (or stride-2, ...) registers to the register allocator. I'm kind of assuming this is intractable in practice though, useful as it'd be. I'm far from an expert in the "new" RA though -- maybe others have input on how feasible it might be to add such support?
Julian
On Wed, Dec 01, 2010 at 11:24:01AM +0000, Julian Brown wrote:
PowerPC also has load/store multiple, but I guess they are generated in the same phase as for ARM. Maybe there are other architectures that do that allocate contiguous register but earlier?
I don't know about other architectures which do that.
PowerPC essentially restricts load/store multiple instruction use to the prologue/epilogue. There are some tricks it can play with load-string instructions, but those are restricted to hard registers (see movmemsi_$Nreg patterns in rs6000.md). There's also the *ldmsi$N patterns in rs6000.md, but it's not clear to me what conditions those get generated under (combine?).
-Nathan
On Wed, Dec 01, 2010 at 11:16:16AM +0200, Ira Rosen wrote:
The meaning of the builtin (or maybe a new tree code would be better?) is that the elements of v0, v1 and v2 are deinterleaved. I wanted the MEM_REFs, since we actually have three data accesses here, and something (builtin or tree code) to indicate the deinterleaving. Since the vectors are passed to the builtin, I don't think it's a problem if the statements get separated. When the expander sees the builtin, it has to remove the loads it created for the MEM_REFs and create a new "vector load multiple and deinterleave". Is that possible?
This is a problem I've struggled with before. My only caution is that representing the MEM_REF's separately from the deinterleaving in the IR allows all sorts of ways (many we haven't thought of yet) for them to get separated, and there's no instruction to efficiently implement the deinterleaving from registers. For instance, suppose a pseudo gets propagated into the builtin and we can't find the MEM_REFs any more. The resulting code could easily be worse than pre-vectorization.
On 1 December 2010 17:57, Daniel Jacobowitz dan@codesourcery.com wrote:
On Wed, Dec 01, 2010 at 11:16:16AM +0200, Ira Rosen wrote:
The meaning of the builtin (or maybe a new tree code would be better?) is that the elements of v0, v1 and v2 are deinterleaved. I wanted the MEM_REFs, since we actually have three data accesses here, and something (builtin or tree code) to indicate the deinterleaving. Since the vectors are passed to the builtin, I don't think it's a problem if the statements get separated. When the expander sees the builtin, it has to remove the loads it created for the MEM_REFs and create a new "vector load multiple and deinterleave". Is that possible?
This is a problem I've struggled with before. My only caution is that representing the MEM_REF's separately from the deinterleaving in the IR allows all sorts of ways (many we haven't thought of yet) for them to get separated, and there's no instruction to efficiently implement the deinterleaving from registers. For instance, suppose a pseudo gets propagated into the builtin and we can't find the MEM_REFs any more. The resulting code could easily be worse than pre-vectorization.
I see. So one builtin for everything, like
vector_load_deinterleave (v0, v1, v2,..., stride,...)
is our only option?
Thanks, Ira
-- Daniel Jacobowitz CodeSourcery
On Thu, Dec 02, 2010 at 10:54:32AM +0200, Ira Rosen wrote:
On 1 December 2010 17:57, Daniel Jacobowitz dan@codesourcery.com wrote:
On Wed, Dec 01, 2010 at 11:16:16AM +0200, Ira Rosen wrote:
The meaning of the builtin (or maybe a new tree code would be better?) is that the elements of v0, v1 and v2 are deinterleaved. I wanted the MEM_REFs, since we actually have three data accesses here, and something (builtin or tree code) to indicate the deinterleaving. Since the vectors are passed to the builtin, I don't think it's a problem if the statements get separated. When the expander sees the builtin, it has to remove the loads it created for the MEM_REFs and create a new "vector load multiple and deinterleave". Is that possible?
This is a problem I've struggled with before. My only caution is that representing the MEM_REF's separately from the deinterleaving in the IR allows all sorts of ways (many we haven't thought of yet) for them to get separated, and there's no instruction to efficiently implement the deinterleaving from registers. For instance, suppose a pseudo gets propagated into the builtin and we can't find the MEM_REFs any more. The resulting code could easily be worse than pre-vectorization.
I see. So one builtin for everything, like
vector_load_deinterleave (v0, v1, v2,..., stride,...)
is our only option?
It's not the only option; the way you've described might work, too.
But yes, it's my opinion that a single builtin is less likely to generate something the compiler can't recover from.
On Wed, 17 Nov 2010 11:21:03 +0000 Julian Brown julian@codesourcery.com wrote:
Shouldn't be more than 1-2 days, but I've been distracted by other bugs...
Well, this took a little longer than I thought. I think it's a good incremental improvement though: it should improve code in a few cases in little-endian mode, as well as fixing some of the bugs in big-endian mode.
I wrote down a few rules regarding element numberings and subregs, which might make the patch easier to understand:
1. For D-sized subregs of Q registers:
Offset 0: least-significant part (always) Offset 8: most-significant part (always)
By my (possibly incorrect) reading of:
http://gcc.gnu.org/onlinedocs/gccint/Regs-and-Memory.html#index-subreg-2013
This is true because the byte offsets follow memory ordering: in either little-endian or big-endian mode, the less-significant D register in a pair comprising a Q register corresponds to a lower memory address.
2. For lane access we should use:
Big-endian mode,
least significant .... most significant Dn[halfelts-1]..Dn[0] D(n+1)[halfelts-1]..D(n+1)[0]
Little-endian mode,
least significant .... most significant Dn[0]..Dn[halfelts-1] D(n+1)[0]..D(n+1)[halfelts-1]
3. GCC's expectation for lane-access is that:
Big-endian mode,
least significant .... most significant D[elts-1] or Q[elts-1] .... D[0] or Q[0]
Little-endian mode,
least significant .... most significant D[0] or Q[0] .... D[elts-1] or Q[elts-1]
4. "Lo" refers to the least-significant part of a vector register, and "hi" refers to the most-significant part.
The patch touches quite a number of patterns to update to the "new" interpretation of lane numbers as outlined above. A couple of things remain a little awkward:
* Two places in generic code depend on the endianness of the target in ways which don't seem to match (4) above (supportable_widening_operation and vect_permute_store_chain): "lo" and "hi" have their meanings exchanged in big-endian mode. I've worked around this by introducing a VECTOR_ELEMENTS_BIG_ENDIAN macro, which is always false on ARM: I'm not very happy with it though. The effect is simply to stop the "swapping" happening for ARM. I vaguely think the middle-end code which does this swapping is probably incorrect to start with.
* I've used subregs to avoid emitting useless "vmov" instructions in many cases, but this causes problems (in combine) with unpack instructions which use vector sign_extend/zero_extend operations: basically I think that something in combine doesn't understand that the zero/sign extensions are happening element-wise, thus considers them to be somehow equivalent to the subreg op. I'm not entirely sure what's causing this, so I've just worked around it using UNSPEC for now (marked with FIXME).
Tested very lightly (vect.exp only, little/big-endian, with and without -mvectorize-with-neon-quad). Currently against CS's trunk branch. Any comments or suggestions? I'm not sure where I should apply this if it's OK and passes more thorough testing...
Cheers,
Julian
ChangeLog
gcc/ * defaults.h (VECTOR_ELEMENTS_BIG_ENDIAN): Define. * tree-vect-data-refs.c (vect_permute_store_chain): Use VECTOR_ELEMENTS_BIG_ENDIAN. * tree-vect-stmts.c (supportable_widening_operation): Likewise. * config/arm/arm.c (arm_can_change_mode_class): New. * config/arm/arm.h (VECTOR_ELEMENTS_BIG_ENDIAN): New. (CANNOT_CHANGE_MODE_CLASS): Use arm_can_change_mode_class. * config/arm/arm-protos.h (arm_can_change_mode_class): Add prototype. * config/arm/neon.md (SE_magic): New code attribute. (vec_extract<mode>): Alter element numbering used for extract operations in big-endian mode. (vec_shr_<mode>, vec_shl_<mode>): Disable in big-endian mode. (neon_move_lo_quad_<mode>, neon_move_hi_quad_<mode>): Remove. (move_hi_quad_<mode>, move_lo_quad_<mode>): Use subregs. (neon_vec_unpack<US>_lo_move, neon_vec_unpack<US>_hi_mode): Use s_register_operand, fix output formatting. (vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>): Fix for big-endian mode. (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>): Use s_register_operand, fix output formatting. (vec_widen_<US>mult_lo_<mode>, vec_widen_<US>mult_hi_<mode>): Fix for big-endian mode. (neon_unpack_<US>_mode): Use s_register_operand. (vec_unpack<US>_lo_<mode>, vec_unpack<US>_hi_<mode>): Use subregs instead of neon_vget_low/high. Work around combiner breakage. (neon_vec_<US>mult_<mode>): (D reg version) use s_register_operand. (vec_widen_<US>mult_hi_<mode>, vec_widen_<US>mult_lo_<mode>): Similar (D reg versions). (vec_pack_trunc_<mode>): (D reg version) Change to expander. Use s_register_operand. Use vector subregs. (neon_vec_pack_trunc_<mode>): Use s_register_operand. (vec_pack_trunc_<mode>): (Q reg version) Use s_register_operand. Fix for big-endian mode.
On Tue, 30 Nov 2010, Julian Brown wrote:
* defaults.h (VECTOR_ELEMENTS_BIG_ENDIAN): Define.
Apart from the point that new target macros should be hooks, the *very first* thing to do with any new macro or hook is to write the .texi documentation, which appears to be missing from this patch. This is best done before making any consequent code changes. That documentation needs to make clear:
* What the effects are on the semantics of GNU C code (that uses generic vector extensions - which in 4.6 include subscripting), if any.
* What the effects are on the semantics of GENERIC and GIMPLE using vector types, if any. If the semantics of memory references to such types are affected, or the semantics of any other existing GENERIC or GIMPLE operation (most likely any operations that may exist involving lane numbers) then you need to make clear how, and ensure that any documentation / comments for that operation are updated as well; the patch submission needs to make clear how you have audited all existing code for correctness given such a change - any code mentioning a GENERIC or GIMPLE code may do transformations assuming particular semantics.
* Likewise, for all machine-independent RTL operations.
* For UNSPECs the documentation issue doesn't arise, but the audit is still relevant.
On Tue, 30 Nov 2010 16:06:32 +0000 (UTC) "Joseph S. Myers" joseph@codesourcery.com wrote:
On Tue, 30 Nov 2010, Julian Brown wrote:
* defaults.h (VECTOR_ELEMENTS_BIG_ENDIAN): Define.
Apart from the point that new target macros should be hooks, the *very first* thing to do with any new macro or hook is to write the .texi documentation, which appears to be missing from this patch. This is best done before making any consequent code changes. That documentation needs to make clear:
I did say I didn't like that bit much :-). With this particular patch (intended merely as a stop-gap measure until we get "array order" loads/stores working with the vectorizer), I don't think it's a good idea to introduce changes to GENERIC/GIMPLE semantics, and you're quite right that adding that macro (or an equivalent target hook) implies doing exactly that. That wasn't my intention at all at this stage, though we may eventually end up doing something similar.
I'll try to think of an alternative way of getting the vectorizer to do what I want, for now...
Thanks,
Julian
linaro-toolchain@lists.linaro.org