NEON vectorization improvements - preliminary notes - linaro-toolchain

List overview All Threads
Download

newer

NEON vectorization improvements - preliminary notes

older

Linaro GDB release process

arm thumb veneer question

Julian Brown

15 Sep 2010 15 Sep '10

9:37 a.m.

Hi,

In case this is useful in its current (unfinished!) form: here are some notes I made whilst looking at a couple of the items listed for CS308 here:

https://wiki.linaro.org/Internal/Contractors/CodeSourcery

Namely:

* automatic vector size selection (it's currently selected by command line switch)

* also consider ARMv6 SIMD vectors (see CS309)

* mixed size vectors (using to most appropriate size in each case)

* ensure that all gcc vectorizer pattern names are implemented in the machine description (those that can be).

I've not even started on looking at:

* loops with more than two basic blocks (caused by if statements (anything else?))

* use of specialized load instructions

* Conversly, perhaps identify NEON capabilities not covered by GCC patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)

* any other missed opportunities (identify common idioms and teach the compiler to deal with them)

I'm not likely to have time to restart work on the vectorization study for at least a couple of days, because of other CodeSourcery work. But perhaps the attached will still be useful in the meantime.

Do you (Ira) have access to the ARM ISA docs detailing the NEON instructions?

Cheers,

Julian

Attachments:

CS308-vectorization-improvements.txt (text/plain — 7.9 KB)

Show replies by date

Andrew Stubbs

15 Sep 15 Sep

9:50 a.m.

On 15/09/10 10:37, Julian Brown wrote:

...

The "vect256" branch now has a vectorization factor argument for UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to support that would need backporting to 4.5 if that looks useful. Could investigate the feasibility of doing that.

Backports to 4.5 would indeed be nice, but the target here is to improve vectorization upstream.

Also, the list in the task was just ideas to get started on, there's no reason to limit investigations to that list, if it turns out to be incomplete - it's not like it was written with any real effort.

Andrew

Ira Rosen

12:23 p.m.

Hi,

I need to learn much more about ARM architecture, but I have some initial comments.

Julian Brown julian@codesourcery.com wrote on 15/09/2010 11:37:21 AM:

...

automatic vector size selection (it's currently selected by command line switch)

...

Generally (check assumption) I think that wider vectors may make inner

loops more efficient,

...

but may increase the size of setup/teardown code (e.g. setup: increased

versioning. Teardown,

...

increased insns for reduction ops). More importantly, sometimes larger

vectors may inhibit vectorization.

...

We ideally want to calculate costs per vector-size per-loop (or per other

vectorization opportunity).

There is a patch http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00167.html that was not committed to mainline (and I think not to vect256, but I am not sure about that). This patch tries to vectorize for the wider option unless it is impossible because of data dependence constraints.

I agree with that cost model approach.

...

ensure that all gcc vectorizer pattern names are implemented in the machine description (those that can be).

In my opinion we better concentrate on:

...

Conversly, perhaps identify NEON capabilities not covered by GCC patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)

Most of the existing vectorizer patterns were inspired by Altivec's capabilities. I think our approach should originate from the architecture and not the other way around. For example, I don't think we should spend time on implementation of vect_extract_even/odd and vect_interleave_high/low (even though they seem to match VUNZIP and VZIP), when we have those amazing VLD2/3/4 and VST2/3/4 instructions.

...

I've not even started on looking at:

loops with more than two basic blocks (caused by if statements (anything else?))

What do you mean by that? If-conversion improvements?

...

Do you (Ira) have access to the ARM ISA docs detailing the NEON instructions?

I have "ARM® Architecture Reference Manual ARM®v7-A and ARM®v7-R edition".

Ira

...

Cheers,

Julian[attachment "CS308-vectorization-improvements.txt" deleted by Ira Rosen/Haifa/IBM]

Ira Rosen

22 Sep 22 Sep

11:23 a.m.

Hi Julian,

Here are some thoughts about your report.

...

Automatic vector size selection/mixed-size vectors

I think we (I) need to cooperate with Richard Guenther: ask him about committing his patch to 4.6 (they are probably planning to merge vect256 into 4.7?), offer help, etc. Looks like the last patch was committed to vect256 in May... What do you think?

I can try to apply his patch and see how it behaves on ARM, once I have access to an ARM board.

...

Unimplemented GCC vector pattern names

...

movmisalign<mode>

Implemented by: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00214.html

Are you waiting for approval from ARM maintainers? Can I help somehow? I think this patch is very important. Without it only aligned accesses can be vectorized.

...

vec_extract_even

(and interleave)

...

We can add, as a quick solution, those VZIP and VUZIP mappings. However, in the long term, I think we need to exploit NEON's strided loads and stores.

...

sdot_prod<mode>, udot_prod<mode>

dot_prod (va, vb, acc) = { va.V0 * vb.V0 + acc.V0, va.V1 * vb.V1 + acc.V1, ... } meaning it's a multiply-add, where acc and result are of twice of the length of va and vb. And yes, it is kind of several parallel dot-product operations, as you wrote. In the end of a vector loop we have a vector of partial results, which we have to reduce to a scalar result in a reduction epilogue.

...

ssum_widen<mode>3, usum_widen<mode>3

Implemented, but called widen_[us]sum<mode>3. Doc or code bug? (Doc, I

think.)

This is how it is mapped in genopinit.c:

"set_optab_handler (ssum_widen_optab, $A, CODE_FOR_$(widen_ssum$I$a3$))", "set_optab_handler (usum_widen_optab, $A, CODE_FOR_$(widen_usum$I$a3$))",

So, it is implemented.

...

vec_pack_trunc_<mode>

Not implemented. ARM have a patch:

This is implemented in neon.md.

...

vec_pack_ssat_<mode>, vec_pack_usat_<mode>

Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed).

The only target that implements that is mips, so I am not sure it is used/needed.

...

vec_widen_[us]mult_{hi,lo}_<mode>

This is used for widening multiplication:

int a[N]; short b[N], c[N];

for i a[i] = b[i] * c[i]

which gets vectorized as following:

vector int v0, v1; for i

v0 = vec_widen_smult_hi (b[8i:8i+7], c[8i:8i+7]); v1 = vec_widen_smult_lo (b[8i:8i+7], c[8i:8i+7]); c[8i:8i+3] = v0; c[8i+4:8i+7] = v1;

I think, on NEON we can just use VMULL (and one store) to do this. But, of course, it requires support on the vectorizer side, including probably multiple vector size support, unless it can be abstracted out somehow...

(After writing that, I checked neon.md ;), and these two are actually there, implemented with two instructions each (if I read it correctly). So with the current implementation we need 6 instructions instead of, hopefully, only 2).

...

vec_unpack[su]_{hi,lo}_<mode>

Not implemented. (Do ARM have a patch for this one?)

I see them in neon.md.

...

NEON capabilities not covered by the vectorizer

I would start from a typical benchmark and see what features are required.

...

The goal is a 15% speed improvement in EEMBC relative to FSF GCC 4.5.0

Does this mean to improve a single benchmark from EEMBC by 15%? Do you have an EEMBC? I have a very old version, without DENBench, which looks interesting according to EEMBC's site. Other than that TeleBench and Consumer might have vectorization potential.

We have holidays till October 3, so probably I will not be able to respond until then.

Thanks, Ira

Andrew Stubbs

4:03 p.m.

On 22/09/10 12:23, Ira Rosen wrote:

...

...
The goal is a 15% speed improvement in EEMBC relative to FSF GCC 4.5.0

Does this mean to improve a single benchmark from EEMBC by 15%? Do you have an EEMBC? I have a very old version, without DENBench, which looks interesting according to EEMBC's site. Other than that TeleBench and Consumer might have vectorization potential.

I don't think this is well defined, so, I guess that means we get to twist it any way we like. ;)

Seriously though, I would argue that it should be 15% on average, across all the EEMBC tests, or at least across all the tests where vectorization makes sense.

The exact "15%" figure is not based on any hard facts, so we don't need to get too precise about it.

Andrew

5428

days inactive

5435

days old

linaro-toolchain@lists.linaro.org

4 comments

participants

tags (0)

participants (3)

Andrew Stubbs
Ira Rosen
Julian Brown