"Singh, Ravi Kumar (Ravi)" Ravi.Singh@lsi.com wrote:
None of the generated code contains the NEON instructions. Code generated with case 1 is taking 3000 cycles, and code generated by option 2 is taking 2500 cycles.
Even if vectorization failed in case1, it should not generate more inefficient code than case 2. My belief was that the executables from both would take same cycles, any thing done for doing unsuccessful vectorization must be reverted if it did not succeed.
I suspect the reason vectorization fails is the direct reference to the loop counter in the inner loop: index = j;
After vectorization, the loop counter is no longer available, so code that accesses is as in your example usually cannot be automatically vectorized.
As to why -ftree-vectorize still generates different code, that is probably because the flag actually enables two other optimizations that are distinct from the vectorizer, but usually enable it to do a better job: if-conversion and store-sinking.
I suspect in your test case, if-conversion actually transforms the if in the inner loop. However, if the result is then still not vectorizable, that transformation might happen to be a net loss ...
You can switch off those extra transformations while still enabling vectorization using something like: -ftree-vectorize -fno-tree-if-conversion --param max-stores-to-sink=0
(Note that this might cause some loops that would otherwise have been vectorized to become non-vectorizable ...)
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294