Hi Yupeng,
Great testcase, thanks!
I've investigated this, and there are two separate changes between GCC 7 and GCC 8 each causing half of the regression.
The first regression is due compiler making unlucky decisions. Before the regression compiler just got lucky, and I'll look into bringing these lucky decisions back.
The second regression is due to changed tuning of the compiler. New setting is better on average, and this testcase happens to regress. I don't know whether we'll manage to fix that.
-- Maxim Kuvyrkov www.linaro.org
On Aug 21, 2019, at 9:55 AM, Yupeng Chang changyp6@gmail.com wrote:
Hi Maxim, Attached is the testcase.
Please follow these steps to test:
- download GCC 8.3 from: https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/g...
- download GCC 7.4 from: https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-l...
Please extract the toolchain into /usr/local
- extract the attached package arm-performance.tar.xz into your linux machine's home folder
cd into arm-performance run "make" to generate the test program and the ASM code dumped from .o file
On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote: Hi Yupeng,
There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is difficult to guess what may be going wrong.
Do you have a testcase that you can share? With a testcase we can investigate the problem and, possibly, fix it.
Regards,
-- Maxim Kuvyrkov www.linaro.org
On Aug 19, 2019, at 7:14 AM, Yupeng Chang changyp6@gmail.com wrote:
Hi Dear Linaro Team, I recently found a very strange issue regarding the code performance. I have a loop written in GCC NEON. The binary of this coded generated by Linaro GCC 7.x is much faster than it generated by ARM GCC 8.x
My CPU is ARM Cortex-A53 AARCH64. The compile option is: -Wall -O3 -mcpu=cortex-a53+crypto
the code is like below: for (uint32 c = 0; c < channels; c += 16, roi_result += 16) { int32x4_t S1, S2, S3, S4; int16x4_t DT;
DT = vld1_s16(feature1 + c + 0); S1 = vmull_lane_s16(DT, SZ, 0); DT = vld1_s16(feature1 + c + 4); S2 = vmull_lane_s16(DT, SZ, 0); DT = vld1_s16(feature1 + c + 8); S3 = vmull_lane_s16(DT, SZ, 0); DT = vld1_s16(feature1 + c + 12); S4 = vmull_lane_s16(DT, SZ, 0); DT = vld1_s16(feature2 + c + 0); S1 = vmlal_lane_s16(S1, DT, SZ, 1); DT = vld1_s16(feature2 + c + 4); S2 = vmlal_lane_s16(S2, DT, SZ, 1); DT = vld1_s16(feature2 + c + 8); S3 = vmlal_lane_s16(S3, DT, SZ, 1); DT = vld1_s16(feature2 + c + 12); S4 = vmlal_lane_s16(S4, DT, SZ, 1); DT = vld1_s16(feature3 + c + 0); S1 = vmlal_lane_s16(S1, DT, SZ, 2); DT = vld1_s16(feature3 + c + 4); S2 = vmlal_lane_s16(S2, DT, SZ, 2); DT = vld1_s16(feature3 + c + 8); S3 = vmlal_lane_s16(S3, DT, SZ, 2); DT = vld1_s16(feature3 + c + 12); S4 = vmlal_lane_s16(S4, DT, SZ, 2); DT = vld1_s16(feature4 + c + 0); S1 = vmlal_lane_s16(S1, DT, SZ, 3); DT = vld1_s16(feature4 + c + 4); S2 = vmlal_lane_s16(S2, DT, SZ, 3); DT = vld1_s16(feature4 + c + 8); S3 = vmlal_lane_s16(S3, DT, SZ, 3); DT = vld1_s16(feature4 + c + 12); S4 = vmlal_lane_s16(S4, DT, SZ, 3); DT = vrshrn_n_s32(S1, Q_VALUE); vst1_s16(roi_result + 0, DT); DT = vrshrn_n_s32(S2, Q_VALUE); vst1_s16(roi_result + 4, DT); DT = vrshrn_n_s32(S3, Q_VALUE); vst1_s16(roi_result + 8, DT); DT = vrshrn_n_s32(S4, Q_VALUE); vst1_s16(roi_result + 12, DT); }
Code generated by GCC7: 294: 6b10031f cmp w24, w16 298: fc606959 ldr d25, [x10, x0] 29c: fc686922 ldr d2, [x9, x8] 2a0: fc676921 ldr d1, [x9, x7] 2a4: fc666920 ldr d0, [x9, x6] 2a8: fc686958 ldr d24, [x10, x8] 2ac: fc676957 ldr d23, [x10, x7] 2b0: fc666956 ldr d22, [x10, x6] 2b4: fc606855 ldr d21, [x2, x0] 2b8: fc686854 ldr d20, [x2, x8] 2bc: fc676853 ldr d19, [x2, x7] 2c0: fc666852 ldr d18, [x2, x6] 2c4: fc606891 ldr d17, [x4, x0] 2c8: fc686890 ldr d16, [x4, x8] 2cc: fc676887 ldr d7, [x4, x7] 2d0: fc666885 ldr d5, [x4, x6] 2d4: 0f44a063 smull v3.4s, v3.4h, v4.h[0] 2d8: 0f44a042 smull v2.4s, v2.4h, v4.h[0] 2dc: 0f44a021 smull v1.4s, v1.4h, v4.h[0] 2e0: 0f44a000 smull v0.4s, v0.4h, v4.h[0] 2e4: 0f542323 smlal v3.4s, v25.4h, v4.h[1] 2e8: 0f542302 smlal v2.4s, v24.4h, v4.h[1] 2ec: 0f5422e1 smlal v1.4s, v23.4h, v4.h[1] 2f0: 0f5422c0 smlal v0.4s, v22.4h, v4.h[1] 2f4: 0f6422a3 smlal v3.4s, v21.4h, v4.h[2] 2f8: 0f642282 smlal v2.4s, v20.4h, v4.h[2] 2fc: 0f642261 smlal v1.4s, v19.4h, v4.h[2] 300: 0f642240 smlal v0.4s, v18.4h, v4.h[2] 304: 0f742223 smlal v3.4s, v17.4h, v4.h[3] 308: 0f742202 smlal v2.4s, v16.4h, v4.h[3] 30c: 0f7420e1 smlal v1.4s, v7.4h, v4.h[3] 310: 0f7420a0 smlal v0.4s, v5.4h, v4.h[3] 314: 0f138c63 rshrn v3.4h, v3.4s, #13 318: 0f138c42 rshrn v2.4h, v2.4s, #13 31c: 0f138c21 rshrn v1.4h, v1.4s, #13 320: 0f138c00 rshrn v0.4h, v0.4s, #13 324: 6d3e0a63 stp d3, d2, [x19, #-32] 328: 6d3f0261 stp d1, d0, [x19, #-16]
Code generated by GCC8:
26c: 6b0b02ff cmp w23, w11 270: fc606922 ldr d2, [x9, x0] 274: fc666941 ldr d1, [x10, x6] 278: fc666920 ldr d0, [x9, x6] 27c: 0f44a000 smull v0.4s, v0.4h, v4.h[0] 280: 0f542020 smlal v0.4s, v1.4h, v4.h[1] 284: fc6668e1 ldr d1, [x7, x6] 288: 0f642020 smlal v0.4s, v1.4h, v4.h[2] 28c: fc646945 ldr d5, [x10, x4] 290: fc666901 ldr d1, [x8, x6] 294: 0f742020 smlal v0.4s, v1.4h, v4.h[3] 298: fc646921 ldr d1, [x9, x4] 29c: 0f44a021 smull v1.4s, v1.4h, v4.h[0] 2a0: 0f5420a1 smlal v1.4s, v5.4h, v4.h[1] 2a4: fc626945 ldr d5, [x10, x2] 2a8: 0f138c03 rshrn v3.4h, v0.4s, #13 2ac: fc626920 ldr d0, [x9, x2] 2b0: 0f44a000 smull v0.4s, v0.4h, v4.h[0] 2b4: 0f5420a0 smlal v0.4s, v5.4h, v4.h[1] 2b8: fc606945 ldr d5, [x10, x0] 2bc: 0f44a042 smull v2.4s, v2.4h, v4.h[0] 2c0: 0f5420a2 smlal v2.4s, v5.4h, v4.h[1] 2c4: fc6468e5 ldr d5, [x7, x4] 2c8: 0f6420a1 smlal v1.4s, v5.4h, v4.h[2] 2cc: fc6268e5 ldr d5, [x7, x2] 2d0: 0f6420a0 smlal v0.4s, v5.4h, v4.h[2] 2d4: fc6068e5 ldr d5, [x7, x0] 2d8: 0f6420a2 smlal v2.4s, v5.4h, v4.h[2] 2dc: fc646905 ldr d5, [x8, x4] 2e0: 0f7420a1 smlal v1.4s, v5.4h, v4.h[3] 2e4: fc626905 ldr d5, [x8, x2] 2e8: 0f138c21 rshrn v1.4h, v1.4s, #13 2ec: 0f7420a0 smlal v0.4s, v5.4h, v4.h[3] 2f0: 0f138c00 rshrn v0.4h, v0.4s, #13 2f4: fc606905 ldr d5, [x8, x0] 2f8: 0f7420a2 smlal v2.4s, v5.4h, v4.h[3] 2fc: 0f138c42 rshrn v2.4h, v2.4s, #13 300: 6d000e62 stp d2, d3, [x19] 304: 6d010261 stp d1, d0, [x19, #16] 308: 91008273 add x19, x19, #0x20
I did some tests on different compile options, and found that option "-fschedule-insns" on GCC 7 will generate code that runs faster, if I disable schedule-insns, GCC7 will generate the same code as GCC8. However, this option seems don't work on GCC8, if I enable "-fschedule-insns" with GCC8, the code generated by GCC8 is even slower. If I disable "-fschedule-insns" with GCC8, the generated code is just like the sequence as in C code.
I compiled my code with -O3, which means -fschedule-insns will be enabled by default.
With this option enabled, GCC7 will reschedule instructions, and it seems that GCC7 will arrange the same instructions all together, but GCC8 doesn't do that, or GCC8 will reschedule instructions in a worse way.
My question is, is this behavior expected in GCC8, GCC9 and the future version? Is this change in GCC code scheduling related to the fix of "spectre and mitigation" ?
If I want the same instruction scheduling mechanism in GCC8, what can I do ?
Thank you for looking into this.
Looking forward to your reply!
Tomas Chang Aug 19, 2019 _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
<arm-performance.tar.xz>