Hi,
I've now put this at :
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-11-15
Are there any other topics that folks want to bring up ?
The one thing worth thinking about ahead of time is if we want to bring ahead the call by an hour to allow Michael to join at a not so crazy hour for him. What do folks think of 9 a.m. Tuesdays / Wednesdays UTC ?
cheers Ramana
Hi,
Are there any other topics that folks want to bring up ?
There are some issues exposed while testing the register pressure estimation for SMS that I would to get some feedback on:
As discussed off-line; one thing is related to the note_uses function which currently does not take element zero into account when dealing with ZERO_EXTRACT case (http://gcc.gnu.org/ml/gcc/2011-10/msg00419.html). I've bootstrap the solution of adding (*fun) (&XEXP (dest, 0), data); on PowerPC and I get a bootstrap failure (Internal error: abort in get_output_file_with_visibility, at gengtype.c:2093) . I debugged it to a point I know that the following expression causes it (applying this change for the 8799th time on this operation cause the failure); however I am not sure how to proceed with it as this operation does not look faulty; so I appropriate directions to precede.
(zero_extract:DI (reg:DI 2829) (const_int 1 [0x1]) (const_int 3 [0x3]))
Another issue is related to the regression I saw with SMS in libav's dsputil-ssd_int8_vs_int16_c. Consulting with Ayal regarding this it seemed that the regression was due to dependence between accumulations that can be avoided, more specifically we had the following case in vector code:
vec1 = vec1 + ... ... vec1 = vec1+ ... ... vec1 = vec1+ ... ... vec1 = vec1+...
to resolve this, I implemented a hack similar to MVE optimiation in the loop-unroller as follows:
vec1 = vec1 + ... ... vec2 = vec2+ ... ... vec3 = vec3+ ... ... vec4 = vec4+...
This gives ~4.5% improvements to the non-SMSed version. I was thinking of submitting this patch and I would appreciate thoughts about where to place it in the passes pipeline.
Thanks, Revital
Revital Eres revital.eres@linaro.org writes:
Another issue is related to the regression I saw with SMS in libav's dsputil-ssd_int8_vs_int16_c. Consulting with Ayal regarding this it seemed that the regression was due to dependence between accumulations that can be avoided, more specifically we had the following case in vector code:
vec1 = vec1 + ... ... vec1 = vec1+ ... ... vec1 = vec1+ ... ... vec1 = vec1+...
to resolve this, I implemented a hack similar to MVE optimiation in the loop-unroller as follows:
vec1 = vec1 + ... ... vec2 = vec2+ ... ... vec3 = vec3+ ... ... vec4 = vec4+...
While I agree that's a useful transformation, do you have a few more details about the SMS regression? I assume both the non-SMS and SMS loops use the:
vec1 = vec1 + ... ... vec1 = vec1+ ... ... vec1 = vec1+ ... ... vec1 = vec1+...
chain, so what makes the SMS version of it worse than the non-SMS version?
Richard
HI Richard,
chain, so what makes the SMS version of it worse than the non-SMS version?
I attached the SMS dump file. The problematic loop is the one with "SMS succeeded 36 2" (there are three loops in total in this file). Due to these accumulators min ii is 36 which seems to cause SMS to take wrong decisions.
SMS iis 36 36 72 (rec_mii, mii, maxii)
btw, examining the following loop without SMS compiled with : -c -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
; I see two vmla.i32 in a raw at the end of it and I wonder why they end up to been so close? (isn't there a delay between them that can be filled by moving vsub between them?)
Thanks, Revital
.L7: mov r1, ip vldmia r5!, {d18-d19} vmovl.s8 q11, d18 add r0, r0, #1 vld1.16 {q12}, [r1]! cmp r0, r7 vmovl.s8 q9, d19 add ip, ip, #32 vmovl.s16 q14, d22 vmovl.s16 q10, d24 vmovl.s16 q13, d25 vmovl.s16 q11, d23 vsub.i32 q10, q14, q10 vld1.16 {q12}, [r1] vsub.i32 q11, q11, q13 vmla.i32 q8, q10, q10 vmovl.s16 q13, d18 vmovl.s16 q10, d24 vmovl.s16 q9, d19 vmovl.s16 q12, d25 vsub.i32 q10, q13, q10 vmla.i32 q8, q11, q11 vsub.i32 q9, q9, q12 vmla.i32 q8, q10, q10 vmla.i32 q8, q9, q9 bcc .L7
Revital Eres revital.eres@linaro.org writes:
chain, so what makes the SMS version of it worse than the non-SMS version?
I attached the SMS dump file. The problematic loop is the one with "SMS succeeded 36 2" (there are three loops in total in this file). Due to these accumulators min ii is 36 which seems to cause SMS to take wrong decisions.
SMS iis 36 36 72 (rec_mii, mii, maxii)
OK, so the minimum ii comes from each dependency in the chain of 4 accumulations having a latency of 9 cycles. But the A9 TRM says:
If a multiply-accumulate follows a multiply or another multiply-accumulate, and depends on the result of that first instruction, then if the dependency between both instructions are of the same type and size, the processor uses a special multiplier accumulator forwarding. This special forwarding means the multiply instructions can issue back-to-back because the result of the first instruction in cycle 5 is forwarded to the accumulator of the second instruction in cycle 4. If the size and type of the instructions do not match, then Dd or Qd is required in cycle 3. This applies to combinations of the multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the multiply instructions VMUL andVQDMUL.
So I think the problem is that successive VMLAs don't in fact have a latency of 9. However, this doesn't seem to be modelled in the ARM backend, either through bypasses or in a sched-reorder hook. In contrast, the A8 pipeline description has:
;; A multiply with a single-register result or an MLA, followed by an ;; MLA with an accumulator dependency, has its result forwarded so two ;; such instructions can issue back-to-back. (define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy" "cortex_a8_mla" "arm_mac_accumulator_is_mul_result")
I'm not sure from the A9 description whether "following" means "immediately following", or whether gaps between instructions are allowed (and, in the latter case, whether the gap can be filled with arbitrary instructions, or whether restrictions apply, such as "anything but another NEON multiplication"). Ramana, do you know?
Anyway, I think this explains why the non-SMS loop executes more quickly than GCC expects, and why the SMS loop is slower than it needs to be. It might be worth comparing the two loops with -mtune=cortex-a8.
Richard
Hi,
Anyway, I think this explains why the non-SMS loop executes more quickly than GCC expects, and why the SMS loop is slower than it needs to be. It might be worth comparing the two loops with -mtune=cortex-a8.
Thanks for the detailed explanation!
I see this regression on cortex-a8 as well. Also, there is still a delay of 9 between the accumulators shown in the SMS dumps running with -mtune=cortex-a8 -mcpu=cortex-a8 .
Thanks, Revital
On 15 November 2011 09:19, Richard Sandiford richard.sandiford@linaro.org wrote:
Revital Eres revital.eres@linaro.org writes:
chain, so what makes the SMS version of it worse than the non-SMS version?
I attached the SMS dump file. The problematic loop is the one with "SMS succeeded 36 2" (there are three loops in total in this file). Due to these accumulators min ii is 36 which seems to cause SMS to take wrong decisions.
SMS iis 36 36 72 (rec_mii, mii, maxii)
OK, so the minimum ii comes from each dependency in the chain of 4 accumulations having a latency of 9 cycles. But the A9 TRM says:
If a multiply-accumulate follows a multiply or another multiply-accumulate, and depends on the result of that first instruction, then if the dependency between both instructions are of the same type and size, the processor uses a special multiplier accumulator forwarding. This special forwarding means the multiply instructions can issue back-to-back because the result of the first instruction in cycle 5 is forwarded to the accumulator of the second instruction in cycle 4. If the size and type of the instructions do not match, then Dd or Qd is required in cycle 3. This applies to combinations of the multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the multiply instructions VMUL andVQDMUL.
So I think the problem is that successive VMLAs don't in fact have a latency of 9. However, this doesn't seem to be modelled in the ARM backend, either through bypasses or in a sched-reorder hook. In contrast, the A8 pipeline description has:
This should be identical for both the A8 and A9 descriptions.
;; Instructions using this reservation read their (D|Q)n operands at N2, ;; their (D|Q)m operands at N1, their (D|Q)d operands at N3, and ;; produce a result at N6 on cycle 4. (define_insn_reservation "cortex_a8_neon_mla_qqq_32_qqd_32_scalar" 9 (and (eq_attr "tune" "cortexa8") (eq_attr "neon_type" "neon_mla_qqq_32_qqd_32_scalar")) "cortex_a8_neon_dp_4")
I thought I spotted the bypass for this but you are right, there is no bypass that handles this particular case.
;; A multiply with a single-register result or an MLA, followed by an ;; MLA with an accumulator dependency, has its result forwarded so two ;; such instructions can issue back-to-back. (define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy" "cortex_a8_mla" "arm_mac_accumulator_is_mul_result")
But that is modelling only scalar bypasses for the A8 indicating a back to back issue of a multiply followed by an mla. The A9 descriptions should handle this with appropriate issue restrictions.
I'm not sure from the A9 description whether "following" means "immediately following", or whether gaps between instructions are allowed (and, in the latter case, whether the gap can be filled with arbitrary instructions, or whether restrictions apply, such as "anything but another NEON multiplication"). Ramana, do you know?
I don't know the answer to that specific question and will have to try a few experiments.
Anyway, I think this explains why the non-SMS loop executes more quickly than GCC expects, and why the SMS loop is slower than it needs to be. It might be worth comparing the two loops with -mtune=cortex-a8.
Richard
linaro-toolchain@lists.linaro.org