Sorry for the delay in my response, I was sick last week.
I've been spending this week playing around with various representations of the v{ld,st}{1,2,3,4}{,_lane} operations. I agree with Ira that the best representation would be to use built-in functions.
One concern in the original discussion was that the optimisers might move the original MEM_REFs away from the call. I don't think that's a problem though. For loads, we can simply treat the whole of the accessed memory as an array, and pass the array by value. If we do that, then the call would just look like:
__builtin_load_lanes (MEM_REF[(elem[N] *)ADDR])
(where, despite the C notation, the MEM_REF accesses the whole of elem
[N]).
It is of course possible in principle for the tree optimisers to replace this MEM_REF with another, equivalent, one, but that's OK semantically. It isn't possible for the optimisers to replace it with something like an SSA name, because arrays can't be stored in gimple registers.
__builtin_load_lanes would then be used like this:
combined_vectors = __builtin_load_lanes (...); vector1 = ...extract first vector from combined_vectors... vector2 = ...extract second vector from combined_vectors... ....
This looks good from the vectorizer point of view.
So combined_vectors only exists for load and extract operations. The question then is: what type should it have? (At this point I'm just talking about types, not modes.) The main possibilities seemed to
be:
an integer type
Pros * Gimple registers can store integers.
Cons * As Julian points out, GCC doesn't really support integer types that are wider than 2 HOST_WIDE_INTs. It would be good to remove that restriction, but it might be a lot of work, and it isn't something we'd want to take on as part of this project.
* We're not really using the type as an integer. * The combination of the integer type and the __builtin_load_lanes array argument wouldn't be enough to determine the correct load operation. __builtin_load_lanes would need something like a vector count (N => vldN) argument as well.
a combined vector type
Pros * Gimple registers can store vectors.
Cons * For vld3, this would mean creating vector types with non-power- of-two vectors. GCC doesn't support those yet, and you get ICEs as soon as you try to use them. (Remember that this is all about types, not modes.)
It _might_ be interesting to implement this support, but as above, it would be a lot of work. It also raises some semantic questions, such as: what is the alignment of the new vectors? Which leads to... * The alignment of the type would be strange. E.g. suppose we're loading N*2 uint32_ts into N vectors of 2 elements each. The types and alignments would be: N=2 uint32x4_t, alignment 16 N=3 uint32x6_t, alignment 8 (if we follow the convention for
modes)
N=4 uint32x8_t, alignment 32 We don't need alignments greater than 8 in our intended use; 16 and 32 are overkill. * We're not really using the type as a single vector, but as a collection of vectors. * The combination of the vector type and the __builtin_load_lanes array argument wouldn't be enough to determine the correct load operation. __builtin_load_lanes would need something like a vector count (N => vldN) argument as well.
an array of vectors type
Pros * No support for new GCC features (large integers or
non-power-of-two
vectors) is needed. * The alignment of the type would be taken from the alignment of
the
individual vectors, which is correct. * It accurately reflects how the loaded value is going to be used. * The type uniquely identifies the correct load operation, without need for additional arguments. (This is minor.) Cons * Gimple registers can't store array values.
So I think the only disadvantage of using an array of vectors is that the result can never be a gimple register. But that isn't much of a
disadvantage
really; the things we care about are the individual vectors, which can of course be treated as gimple registers. I think our tracking of memory values is good enough for combined_vectors to be treated as such (even though, with the back-end changes we talked about earlier, they will actually be stored in RTL registers).
I agree that an array of vectors seems to be the best option here.
So how about the following functions? (Forgive the pascally syntax.)
__builtin_load_lanes (REF : array N*M of X) returns array N of vector M of X maps to vldN in practice, the result would be used in assignments of the form: vectorX = ARRAY_REF <result, X> __builtin_store_lanes (VECTORS : array N of vector M of X) returns array N*M of X maps to vstN in practice, the argument would be populated by assignments ofthe
form:
vectorX = ARRAY_REF <result, X> __builtin_load_lane (REF : array N of X, VECTORS : array N of vector M of X, LANE : integer) returns array N of vector M of X maps to vldN_lane __builtin_store_lane (VECTORS : array N of vector M of X, LANE : integer) returns array N of X maps to vstN_lane
How do you distinguish between "multiple structures" and "single structure to all lanes"?
Note that each operation can be expanded independently. The expansion doesn't rely on preceding or following statements.
I've hacked up the prototype below as a proof of concept. It includes changes to the C parser to allow these functions to be created in the original source code. This is throw-away code though; it would never be submitted.
I've also included a simple test case and the output I get from it. The output looks pretty good; there's not even the stray VMOV that I saw with the intrinsics earlier in the week.
(Note that if you'd like to try this yourself, you'll need the patch I posted on Monday as well.)
What do you think? Obviously this discussion needs to move to gcc@ at some point,
Good idea.
Ira
but I wanted to make sure this was vaguely sane first.
Richard
[attachment "lane-functions.patch" deleted by Ira Rosen/Haifa/IBM] [attachment "test.c" deleted by Ira Rosen/Haifa/IBM] [attachment "test.s" deleted by Ira Rosen/Haifa/IBM] _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain