Hi,
Anyway, I think this explains why the non-SMS loop executes more quickly than GCC expects, and why the SMS loop is slower than it needs to be. It might be worth comparing the two loops with -mtune=cortex-a8.
Thanks for the detailed explanation!
I see this regression on cortex-a8 as well. Also, there is still a delay of 9 between the accumulators shown in the SMS dumps running with -mtune=cortex-a8 -mcpu=cortex-a8 .
Thanks, Revital