Hello All, I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while I'm doing lmbench benchmark test for our LS1043 (cortex-A53). While using gcc-linaro-4.8, gcc will generate advanced SIMD instructions (like as ld1, etc), however, gcc-linaro-5.1 will not generate advance SIMD instructions. This will cause big performance gap between gcc-4.8 and gcc-5.1 for lmbench memory bandwidth "fcp" test (bw_mem program).
My compiler flags is "-O3 -mcpu=cortex-a53". I also tried several different compiler flags ("-O3 -mcpu=cortex-a53+fp+simd", "-O2 -ftree-vectorize -mcpu=cortex-a53", "-O3 -ftree-vectorize -mcpu=cortex-a53"), all of them doesn't work.
Gcc-5.1 toolchain was downloaded from following link:
https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snapshot-20...
Can I have your comments on this?
Thanks Ron
Hello,
I'm not sure from the information below whether you have observed a performance gap, or are expecting to observe one. Have you seen a performance gap?
Regards,
Bernie
On 5 January 2016 at 10:29, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Hello All,
I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while I’m doing lmbench benchmark test for our LS1043 (cortex-A53).
While using gcc-linaro-4.8, gcc will generate advanced SIMD instructions (like as ld1, etc), however, gcc-linaro-5.1 will not generate advance SIMD instructions. This will cause big performance gap between gcc-4.8 and gcc-5.1 for lmbench memory bandwidth “fcp” test (bw_mem program).
My compiler flags is “-O3 -mcpu=cortex-a53”. I also tried several different compiler flags (“-O3 -mcpu=cortex-a53+fp+simd”, “-O2 -ftree-vectorize -mcpu=cortex-a53”, “-O3 -ftree-vectorize -mcpu=cortex-a53”), all of them doesn’t work.
Gcc-5.1 toolchain was downloaded from following link:
https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snapshot-20...
Can I have your comments on this?
Thanks
Ron
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Hello Bernie, Thanks for your quick response.
Yes, I observed performance gap. Followings are data what I got on our LS1043A platform:
fcp for L1 cache with gcc-4.8: 5196.12 MB/s for L1 cache fcp for L1 cache with gcc-5.1: 2983.11 MB/s for L1 cache
Following part of assembly code for fcp function:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48] 401110: 3dc0106b ldr q11, [x3,#64] 401114: 3dc0146a ldr q10, [x3,#80] 401118: 3dc01869 ldr q9, [x3,#96] 40111c: 3dc01c68 ldr q8, [x3,#112] 401120: 3dc0207f ldr q31, [x3,#128] 401124: 3dc0247e ldr q30, [x3,#144] 401128: 3dc0287d ldr q29, [x3,#160] 40112c: 3dc02c7c ldr q28, [x3,#176] 401130: 3dc0307b ldr q27, [x3,#192] 401134: 3dc0347a ldr q26, [x3,#208] 401138: 3dc03879 ldr q25, [x3,#224] 40113c: 3dc03c78 ldr q24, [x3,#240] 401140: 3dc04077 ldr q23, [x3,#256] 401144: 3dc04476 ldr q22, [x3,#272] 401148: 3dc04875 ldr q21, [x3,#288] 40114c: 3dc04c74 ldr q20, [x3,#304] 401150: 3dc05073 ldr q19, [x3,#320] 401154: 3dc05472 ldr q18, [x3,#336] 401158: 3dc05871 ldr q17, [x3,#352] 40115c: 3dc05c70 ldr q16, [x3,#368] 401160: 3dc06067 ldr q7, [x3,#384] 401164: 3dc06466 ldr q6, [x3,#400] 401168: 3dc06865 ldr q5, [x3,#416] 40116c: 3dc06c64 ldr q4, [x3,#432] 401170: 3dc07063 ldr q3, [x3,#448] 401174: 3dc07462 ldr q2, [x3,#464] 401178: 3dc07861 ldr q1, [x3,#480] 40117c: 3dc07c60 ldr q0, [x3,#496] 401180: 3dc0006f ldr q15, [x3] 401184: 91080063 add x3, x3, #0x200
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16 401360: 4c40790d ld1 {v13.4s}, [x8] 401364: 4c4078ae ld1 {v14.4s}, [x5] 401368: 9100c048 add x8, x2, #0x30 40136c: 91010045 add x5, x2, #0x40 401370: 4c40790c ld1 {v12.4s}, [x8] 401374: 4c4078ab ld1 {v11.4s}, [x5] 401378: 91014048 add x8, x2, #0x50 40137c: 91018045 add x5, x2, #0x60 401380: 4c40790a ld1 {v10.4s}, [x8] 401384: 4c4078a9 ld1 {v9.4s}, [x5] 401388: 9101c048 add x8, x2, #0x70 40138c: 91020045 add x5, x2, #0x80 401390: 4c407908 ld1 {v8.4s}, [x8] 401394: 4c4078bf ld1 {v31.4s}, [x5] 401398: 91024048 add x8, x2, #0x90 40139c: 91028045 add x5, x2, #0xa0 4013a0: 4c40791e ld1 {v30.4s}, [x8] 4013a4: 4c4078bd ld1 {v29.4s}, [x5] 4013a8: 9102c048 add x8, x2, #0xb0 4013ac: 91030045 add x5, x2, #0xc0
Best Regards Ron
-----Original Message----- From: Bernie Ogden [mailto:bernie.ogden@linaro.org] Sent: Tuesday, January 05, 2016 6:36 PM To: Xiaofeng Ren xiaofeng.ren@nxp.com Cc: linaro-toolchain@lists.linaro.org Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
Hello,
I'm not sure from the information below whether you have observed a performance gap, or are expecting to observe one. Have you seen a performance gap?
Regards,
Bernie
On 5 January 2016 at 10:29, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Hello All,
I found one difference between gcc-linaro-5.1 vs gcc-linaro-4.8 while I’m doing lmbench benchmark test for our LS1043 (cortex-A53).
While using gcc-linaro-4.8, gcc will generate advanced SIMD instructions (like as ld1, etc), however, gcc-linaro-5.1 will not generate advance SIMD instructions. This will cause big performance gap between gcc-4.8 and gcc-5.1 for lmbench memory bandwidth “fcp” test (bw_mem program).
My compiler flags is “-O3 -mcpu=cortex-a53”. I also tried several different compiler flags (“-O3 -mcpu=cortex-a53+fp+simd”, “-O2 -ftree-vectorize -mcpu=cortex-a53”, “-O3 -ftree-vectorize -mcpu=cortex-a53”), all of them doesn’t work.
Gcc-5.1 toolchain was downloaded from following link:
https://snapshots.linaro.org/openembedded/sources/gcc-linaro-5.1-snaps hot-2015.06-1-x86_64_aarch64-linux-gnu.tar.xz
Can I have your comments on this?
Thanks
Ron
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Hi Ron,
Following part of assembly code for fcp function:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48] 401110: 3dc0106b ldr q11, [x3,#64] 401114: 3dc0146a ldr q10, [x3,#80] 401118: 3dc01869 ldr q9, [x3,#96] 40111c: 3dc01c68 ldr q8, [x3,#112] 401120: 3dc0207f ldr q31, [x3,#128] 401124: 3dc0247e ldr q30, [x3,#144] 401128: 3dc0287d ldr q29, [x3,#160] 40112c: 3dc02c7c ldr q28, [x3,#176] 401130: 3dc0307b ldr q27, [x3,#192] 401134: 3dc0347a ldr q26, [x3,#208] 401138: 3dc03879 ldr q25, [x3,#224] 40113c: 3dc03c78 ldr q24, [x3,#240] 401140: 3dc04077 ldr q23, [x3,#256] 401144: 3dc04476 ldr q22, [x3,#272] 401148: 3dc04875 ldr q21, [x3,#288] 40114c: 3dc04c74 ldr q20, [x3,#304] 401150: 3dc05073 ldr q19, [x3,#320] 401154: 3dc05472 ldr q18, [x3,#336] 401158: 3dc05871 ldr q17, [x3,#352] 40115c: 3dc05c70 ldr q16, [x3,#368] 401160: 3dc06067 ldr q7, [x3,#384] 401164: 3dc06466 ldr q6, [x3,#400] 401168: 3dc06865 ldr q5, [x3,#416] 40116c: 3dc06c64 ldr q4, [x3,#432] 401170: 3dc07063 ldr q3, [x3,#448] 401174: 3dc07462 ldr q2, [x3,#464] 401178: 3dc07861 ldr q1, [x3,#480] 40117c: 3dc07c60 ldr q0, [x3,#496] 401180: 3dc0006f ldr q15, [x3] 401184: 91080063 add x3, x3, #0x200
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16 401360: 4c40790d ld1 {v13.4s}, [x8] 401364: 4c4078ae ld1 {v14.4s}, [x5] 401368: 9100c048 add x8, x2, #0x30 40136c: 91010045 add x5, x2, #0x40 401370: 4c40790c ld1 {v12.4s}, [x8] 401374: 4c4078ab ld1 {v11.4s}, [x5] 401378: 91014048 add x8, x2, #0x50 40137c: 91018045 add x5, x2, #0x60 401380: 4c40790a ld1 {v10.4s}, [x8] 401384: 4c4078a9 ld1 {v9.4s}, [x5] 401388: 9101c048 add x8, x2, #0x70 40138c: 91020045 add x5, x2, #0x80 401390: 4c407908 ld1 {v8.4s}, [x8] 401394: 4c4078bf ld1 {v31.4s}, [x5] 401398: 91024048 add x8, x2, #0x90 40139c: 91028045 add x5, x2, #0xa0 4013a0: 4c40791e ld1 {v30.4s}, [x8] 4013a4: 4c4078bd ld1 {v29.4s}, [x5] 4013a8: 9102c048 add x8, x2, #0xb0 4013ac: 91030045 add x5, x2, #0xc0
Is it possible to create a compilable testcase with "fcp" so that we can reproduce the above? It need not be an executable test-case.
Thanks, Kugah
Hello Kugah, Thanks a lot for your support.
I attached source code and corresponding assembly codes which was generated by using gcc-4.8 and gcc-5.1. The compiler flags is "-O3".
Best Regards Ron
-----Original Message----- From: Kugan [mailto:kugan.vivekanandarajah@linaro.org] Sent: Tuesday, January 05, 2016 6:51 PM To: Xiaofeng Ren xiaofeng.ren@nxp.com; Bernie Ogden bernie.ogden@linaro.org Cc: linaro-toolchain@lists.linaro.org Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
Hi Ron,
Following part of assembly code for fcp function:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48] 401110: 3dc0106b ldr q11, [x3,#64] 401114: 3dc0146a ldr q10, [x3,#80] 401118: 3dc01869 ldr q9, [x3,#96] 40111c: 3dc01c68 ldr q8, [x3,#112] 401120: 3dc0207f ldr q31, [x3,#128] 401124: 3dc0247e ldr q30, [x3,#144] 401128: 3dc0287d ldr q29, [x3,#160] 40112c: 3dc02c7c ldr q28, [x3,#176] 401130: 3dc0307b ldr q27, [x3,#192] 401134: 3dc0347a ldr q26, [x3,#208] 401138: 3dc03879 ldr q25, [x3,#224] 40113c: 3dc03c78 ldr q24, [x3,#240] 401140: 3dc04077 ldr q23, [x3,#256] 401144: 3dc04476 ldr q22, [x3,#272] 401148: 3dc04875 ldr q21, [x3,#288] 40114c: 3dc04c74 ldr q20, [x3,#304] 401150: 3dc05073 ldr q19, [x3,#320] 401154: 3dc05472 ldr q18, [x3,#336] 401158: 3dc05871 ldr q17, [x3,#352] 40115c: 3dc05c70 ldr q16, [x3,#368] 401160: 3dc06067 ldr q7, [x3,#384] 401164: 3dc06466 ldr q6, [x3,#400] 401168: 3dc06865 ldr q5, [x3,#416] 40116c: 3dc06c64 ldr q4, [x3,#432] 401170: 3dc07063 ldr q3, [x3,#448] 401174: 3dc07462 ldr q2, [x3,#464] 401178: 3dc07861 ldr q1, [x3,#480] 40117c: 3dc07c60 ldr q0, [x3,#496] 401180: 3dc0006f ldr q15, [x3] 401184: 91080063 add x3, x3, #0x200
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16 401360: 4c40790d ld1 {v13.4s}, [x8] 401364: 4c4078ae ld1 {v14.4s}, [x5] 401368: 9100c048 add x8, x2, #0x30 40136c: 91010045 add x5, x2, #0x40 401370: 4c40790c ld1 {v12.4s}, [x8] 401374: 4c4078ab ld1 {v11.4s}, [x5] 401378: 91014048 add x8, x2, #0x50 40137c: 91018045 add x5, x2, #0x60 401380: 4c40790a ld1 {v10.4s}, [x8] 401384: 4c4078a9 ld1 {v9.4s}, [x5] 401388: 9101c048 add x8, x2, #0x70 40138c: 91020045 add x5, x2, #0x80 401390: 4c407908 ld1 {v8.4s}, [x8] 401394: 4c4078bf ld1 {v31.4s}, [x5] 401398: 91024048 add x8, x2, #0x90 40139c: 91028045 add x5, x2, #0xa0 4013a0: 4c40791e ld1 {v30.4s}, [x8] 4013a4: 4c4078bd ld1 {v29.4s}, [x5] 4013a8: 9102c048 add x8, x2, #0xb0 4013ac: 91030045 add x5, x2, #0xc0
Is it possible to create a compilable testcase with "fcp" so that we can reproduce the above? It need not be an executable test-case.
Thanks, Kugah
On Tue, Jan 5, 2016 at 5:52 AM, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48]
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
The ld1 and ldr instructions are effectively equivalent, they are both loading 16-byte values into fp/simd registers.
I see a difference in the scheduling though. The gcc-4.8 output has a series of shift/add/store instructions while the gcc-5.1 output has a series of shift instructions followed by a series of store instructions. The gcc-5.1 output will serialize the code as these are simd shifts which can only execute one at a time, and stores can only execute one at a time. I see that gcc-4.8 has no cortex-a53 pipeline description, so we appear to be getting good code by accident. The gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd instructions, so it isn't scheduling them correctly. I see that there was a change added in November https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html that adds a new a53 pipeline description, and this one does handle simd instructions. With current sources, I see some shifts, alternating shifts and stores, and then the last of the stores. This should give better performance than the gcc-5.1 code. I haven't tried testing it on hardware.
Jim
Hello Jim, Appreciate for your comments. I will try to manually apply that patch on my side and try it. BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly. https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
Best Regards Ron
-----Original Message----- From: Jim Wilson [mailto:jim.wilson@linaro.org] Sent: Wednesday, January 06, 2016 7:49 AM To: Xiaofeng Ren xiaofeng.ren@nxp.com Cc: Kugan kugan.vivekanandarajah@linaro.org; Bernie Ogden bernie.ogden@linaro.org; linaro-toolchain@lists.linaro.org Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
On Tue, Jan 5, 2016 at 5:52 AM, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Gcc-5.1: 40110c: 3dc00c6c ldr q12, [x3,#48]
Gcc-4.8: 40135c: 4cdf78af ld1 {v15.4s}, [x5], #16
The ld1 and ldr instructions are effectively equivalent, they are both loading 16-byte values into fp/simd registers.
I see a difference in the scheduling though. The gcc-4.8 output has a series of shift/add/store instructions while the gcc-5.1 output has a series of shift instructions followed by a series of store instructions. The gcc-5.1 output will serialize the code as these are simd shifts which can only execute one at a time, and stores can only execute one at a time. I see that gcc-4.8 has no cortex-a53 pipeline description, so we appear to be getting good code by accident. The gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd instructions, so it isn't scheduling them correctly. I see that there was a change added in November https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html that adds a new a53 pipeline description, and this one does handle simd instructions. With current sources, I see some shifts, alternating shifts and stores, and then the last of the stores. This should give better performance than the gcc-5.1 code. I haven't tried testing it on hardware.
Jim
On Tue, Jan 5, 2016 at 4:19 PM, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Hello Jim, Appreciate for your comments. I will try to manually apply that patch on my side and try it. BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly. https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
It was backported to our gcc-5 branch on Nov 24 by Yvan. This is after the latest release 2015-11 was made. The patch is in the December snapshot, but I think that is a source only release. http://snapshots.linaro.org/components/toolchain/gcc-linaro/5.3-2015.12/ You would have to build your own toolchain from that, perhaps by using abe.
Jim
Jim, Thanks a lot for your clarification.
Best Regards Ron
-----Original Message----- From: Jim Wilson [mailto:jim.wilson@linaro.org] Sent: Wednesday, January 06, 2016 10:45 AM To: Xiaofeng Ren xiaofeng.ren@nxp.com Cc: Kugan kugan.vivekanandarajah@linaro.org; Bernie Ogden bernie.ogden@linaro.org; linaro-toolchain@lists.linaro.org; Zhenhua Luo zhenhua.luo@nxp.com Subject: Re: gcc-linaro-5.1 vs gcc-linaro-4.8
On Tue, Jan 5, 2016 at 4:19 PM, Xiaofeng Ren xiaofeng.ren@nxp.com wrote:
Hello Jim, Appreciate for your comments. I will try to manually apply that patch on my side and try it. BTW, may I know which released Linaro gcc version include that patch? Maybe I can download it and try it quickly. https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
It was backported to our gcc-5 branch on Nov 24 by Yvan. This is after the latest release 2015-11 was made. The patch is in the December snapshot, but I think that is a source only release. http://snapshots.linaro.org/components/toolchain/gcc-linaro/5.3-2015.12/ You would have to build your own toolchain from that, perhaps by using abe.
Jim
linaro-toolchain@lists.linaro.org