Hi Julian,
Here are some thoughts about your report.
Automatic vector size selection/mixed-size vectors
I think we (I) need to cooperate with Richard Guenther: ask him about committing his patch to 4.6 (they are probably planning to merge vect256 into 4.7?), offer help, etc. Looks like the last patch was committed to vect256 in May... What do you think?
I can try to apply his patch and see how it behaves on ARM, once I have access to an ARM board.
Unimplemented GCC vector pattern names
movmisalign<mode>
Implemented by: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00214.html
Are you waiting for approval from ARM maintainers? Can I help somehow? I think this patch is very important. Without it only aligned accesses can be vectorized.
vec_extract_even
(and interleave)
We can add, as a quick solution, those VZIP and VUZIP mappings. However, in the long term, I think we need to exploit NEON's strided loads and stores.
sdot_prod<mode>, udot_prod<mode>
dot_prod (va, vb, acc) = { va.V0 * vb.V0 + acc.V0, va.V1 * vb.V1 + acc.V1, ... } meaning it's a multiply-add, where acc and result are of twice of the length of va and vb. And yes, it is kind of several parallel dot-product operations, as you wrote. In the end of a vector loop we have a vector of partial results, which we have to reduce to a scalar result in a reduction epilogue.
ssum_widen<mode>3, usum_widen<mode>3
Implemented, but called widen_[us]sum<mode>3. Doc or code bug? (Doc, I
think.)
This is how it is mapped in genopinit.c:
"set_optab_handler (ssum_widen_optab, $A, CODE_FOR_$(widen_ssum$I$a3$))", "set_optab_handler (usum_widen_optab, $A, CODE_FOR_$(widen_usum$I$a3$))",
So, it is implemented.
vec_pack_trunc_<mode>
Not implemented. ARM have a patch:
This is implemented in neon.md.
vec_pack_ssat_<mode>, vec_pack_usat_<mode>
Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed).
The only target that implements that is mips, so I am not sure it is used/needed.
vec_widen_[us]mult_{hi,lo}_<mode>
This is used for widening multiplication:
int a[N]; short b[N], c[N];
for i a[i] = b[i] * c[i]
which gets vectorized as following:
vector int v0, v1; for i
v0 = vec_widen_smult_hi (b[8i:8i+7], c[8i:8i+7]); v1 = vec_widen_smult_lo (b[8i:8i+7], c[8i:8i+7]); c[8i:8i+3] = v0; c[8i+4:8i+7] = v1;
I think, on NEON we can just use VMULL (and one store) to do this. But, of course, it requires support on the vectorizer side, including probably multiple vector size support, unless it can be abstracted out somehow...
(After writing that, I checked neon.md ;), and these two are actually there, implemented with two instructions each (if I read it correctly). So with the current implementation we need 6 instructions instead of, hopefully, only 2).
vec_unpack[su]_{hi,lo}_<mode>
Not implemented. (Do ARM have a patch for this one?)
I see them in neon.md.
NEON capabilities not covered by the vectorizer
I would start from a typical benchmark and see what features are required.
The goal is a 15% speed improvement in EEMBC relative to FSF GCC 4.5.0
Does this mean to improve a single benchmark from EEMBC by 15%? Do you have an EEMBC? I have a very old version, without DENBench, which looks interesting according to EEMBC's site. Other than that TeleBench and Consumer might have vectorization potential.
We have holidays till October 3, so probably I will not be able to respond until then.
Thanks, Ira