Hello,
I've had a look at the mp3player performance regressions (just with *some* data sets) with the vector-alignment patch. Interestingly it turns out that the patch basically does not change the generated code for the hot spot (inv_mdct routine) at all. (The *only* change is which bits of the incoming pointer the run-time alignment check generated by the vectorizer tests for. But this has no practical consequences, since the check itself is not hot, and the *decision* made by the check is the same anyway -- everything is in fact properly aligned at runtime.)
The other difference, outside of code, introduced by the vector-alignment patch is that some global arrays used to be forcibly aligned to 16 bytes by the vectorizer, and they are now only aligned to 8 bytes. To check whether this makes a difference, I've modified the compiler as a hack to always force all global arrays to be 16 byte aligned. And interestingly enough, this appears to fix this particular performance regression ...
Any thoughts as to why this might be the case? What are the recommendations on the ARM hardware side as to what alignment is prefered?
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
-- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294
On 8 August 2012 16:24, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Hello,
I've had a look at the mp3player performance regressions (just with *some* data sets) with the vector-alignment patch. Interestingly it turns out that the patch basically does not change the generated code for the hot spot (inv_mdct routine) at all. (The *only* change is which bits of the incoming pointer the run-time alignment check generated by the vectorizer tests for. But this has no practical consequences, since the check itself is not hot, and the *decision* made by the check is the same anyway -- everything is in fact properly aligned at runtime.)
The other difference, outside of code, introduced by the vector-alignment patch is that some global arrays used to be forcibly aligned to 16 bytes by the vectorizer, and they are now only aligned to 8 bytes. To check whether this makes a difference, I've modified the compiler as a hack to always force all global arrays to be 16 byte aligned. And interestingly enough, this appears to fix this particular performance regression ...
Any thoughts as to why this might be the case? What are the recommendations on the ARM hardware side as to what alignment is prefered?
This suggests to me that the data layout changes now mean that some of the loads hit two-cache lines (for 8-byte alignment) as opposed to one (for 16-byte alignment)
The naive comment is that everything should be aligned to 32-bytes for A9 (as that is the A9 cache line size: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388-/CIHGGJA...). However for loads of less than 32-bytes alignment to the load-size is OK (as that shouldn't cross a cache line boundary).
Thanks,
Matt
On 8 August 2012 16:48, Matthew Gretton-Dann matthew.gretton-dann@linaro.org wrote:
On 8 August 2012 16:24, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Hello,
I've had a look at the mp3player performance regressions (just with *some* data sets) with the vector-alignment patch. Interestingly it turns out that the patch basically does not change the generated code for the hot spot (inv_mdct routine) at all. (The *only* change is which bits of the incoming pointer the run-time alignment check generated by the vectorizer tests for. But this has no practical consequences, since the check itself is not hot, and the *decision* made by the check is the same anyway -- everything is in fact properly aligned at runtime.)
The other difference, outside of code, introduced by the vector-alignment patch is that some global arrays used to be forcibly aligned to 16 bytes by the vectorizer, and they are now only aligned to 8 bytes. To check whether this makes a difference, I've modified the compiler as a hack to always force all global arrays to be 16 byte aligned. And interestingly enough, this appears to fix this particular performance regression ...
Any thoughts as to why this might be the case? What are the recommendations on the ARM hardware side as to what alignment is prefered?
This suggests to me that the data layout changes now mean that some of the loads hit two-cache lines (for 8-byte alignment) as opposed to one (for 16-byte alignment)
Is something doing unaligned 16-byte loads or stores? An unaligned access straddling cache lines can be very bad for performance.
The naive comment is that everything should be aligned to 32-bytes for A9 (as that is the A9 cache line size:
Too aggressively aligning things to 32 bytes will only end up wasting cache space. What you ideally want is to align everything (at least) to the access size while minimising the number of cache lines required.
On 8 August 2012 16:24, Ulrich Weigand Ulrich.Weigand@de.ibm.com wrote:
Hello,
I've had a look at the mp3player performance regressions (just with *some* data sets) with the vector-alignment patch. Interestingly it turns out that the patch basically does not change the generated code for the hot spot (inv_mdct routine) at all. (The *only* change is which bits of the incoming pointer the run-time alignment check generated by the vectorizer tests for. But this has no practical consequences, since the check itself is not hot, and the *decision* made by the check is the same anyway -- everything is in fact properly aligned at runtime.)
The other difference, outside of code, introduced by the vector-alignment patch is that some global arrays used to be forcibly aligned to 16 bytes by the vectorizer, and they are now only aligned to 8 bytes. To check whether this makes a difference, I've modified the compiler as a hack to always force all global arrays to be 16 byte aligned. And interestingly enough, this appears to fix this particular performance regression ...
Are those arrays involved in any of the hot code?
linaro-toolchain@lists.linaro.org