On Tue, Jan 5, 2016 at 5:52 AM, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48]
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
The ld1 and ldr instructions are effectively equivalent, they are both loading 16-byte values into fp/simd registers.
I see a difference in the scheduling though. The gcc-4.8 output has a series of shift/add/store instructions while the gcc-5.1 output has a series of shift instructions followed by a series of store instructions. The gcc-5.1 output will serialize the code as these are simd shifts which can only execute one at a time, and stores can only execute one at a time. I see that gcc-4.8 has no cortex-a53 pipeline description, so we appear to be getting good code by accident. The gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd instructions, so it isn't scheduling them correctly. I see that there was a change added in November https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html that adds a new a53 pipeline description, and this one does handle simd instructions. With current sources, I see some shifts, alternating shifts and stores, and then the last of the stores. This should give better performance than the gcc-5.1 code. I haven't tried testing it on hardware.
Jim