Re: Vector-alignment patch performance regressions

8 Aug 2012


      On 8 August 2012 16:24, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
...
Hello,
I've had a look at the mp3player performance regressions (just with *some*
data sets) with the vector-alignment patch.  Interestingly it turns out
that the patch basically does not change the generated code for the hot
spot (inv_mdct routine) at all.  (The *only* change is which bits of the
incoming pointer the run-time alignment check generated by the vectorizer
tests for.  But this has no practical consequences, since the check itself
is not hot, and the *decision* made by the check is the same anyway --
everything is in fact properly aligned at runtime.)
The other difference, outside of code, introduced by the vector-alignment
patch is that some global arrays used to be forcibly aligned to 16 bytes by
the vectorizer, and they are now only aligned to 8 bytes.  To check whether
this makes a difference, I've modified the compiler as a hack to always
force all global arrays to be 16 byte aligned.   And interestingly enough,
this appears to fix this particular performance regression ...
Any thoughts as to why this might be the case?  What are the
recommendations on the ARM hardware side as to what alignment is prefered?
This suggests to me that the data layout changes now mean that some of
the loads hit two-cache lines (for 8-byte alignment) as opposed to one
(for 16-byte alignment)
The naive comment is that everything should be aligned to 32-bytes for
A9 (as that is the A9 cache line size:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388-/CIHGGJA...).
 However for loads of less than 32-bytes alignment to the load-size is
OK (as that shouldn't cross a cache line boundary).
Thanks,
Matt
-- 
Matthew Gretton-Dann
Linaro Toolchain Working Group
matthew.gretton-dann@linaro.org

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Vector-alignment patch performance regressions