Re: Agenda for tomorrow's call .

15 Nov 2011


      On 15 November 2011 09:19, Richard Sandiford
richard.sandiford@linaro.org wrote:
...
Revital Eres revital.eres@linaro.org writes:
...
...
chain, so what makes the SMS version of it worse than the non-SMS version?
I attached the SMS dump file. The problematic loop is the one with
"SMS succeeded 36 2" (there are three loops in total in this file).
Due to these accumulators min ii is 36 which seems to cause SMS to
take wrong decisions.
SMS iis 36 36 72 (rec_mii, mii, maxii)
OK, so the minimum ii comes from each dependency in the chain of
4 accumulations having a latency of 9 cycles.  But the A9 TRM says:
If a multiply-accumulate follows a multiply or another
   multiply-accumulate, and depends on the result of that first
   instruction, then if the dependency between both instructions are of the
   same type and size, the processor uses a special multiplier accumulator
   forwarding. This special forwarding means the multiply instructions can
   issue back-to-back because the result of the first instruction in cycle
   5 is forwarded to the accumulator of the second instruction in cycle
   4. If the size and type of the instructions do not match, then Dd or Qd
   is required in cycle 3. This applies to combinations of the
   multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the
   multiply instructions VMUL andVQDMUL.
So I think the problem is that successive VMLAs don't in fact have a
latency of 9.  However, this doesn't seem to be modelled in the ARM
backend, either through bypasses or in a sched-reorder hook.
In contrast, the A8 pipeline description has:
This should be identical for both the A8 and A9 descriptions.
;; Instructions using this reservation read their (D|Q)n operands at N2,
;; their (D|Q)m operands at N1, their (D|Q)d operands at N3, and
;; produce a result at N6 on cycle 4.
(define_insn_reservation "cortex_a8_neon_mla_qqq_32_qqd_32_scalar" 9
  (and (eq_attr "tune" "cortexa8")
       (eq_attr "neon_type" "neon_mla_qqq_32_qqd_32_scalar"))
  "cortex_a8_neon_dp_4")
I thought I spotted the bypass for this but you are right, there is no
bypass that handles this particular case.
...
;; A multiply with a single-register result or an MLA, followed by an
;; MLA with an accumulator dependency, has its result forwarded so two
;; such instructions can issue back-to-back.
(define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy"
              "cortex_a8_mla"
              "arm_mac_accumulator_is_mul_result")
But that is modelling only scalar bypasses for the A8 indicating a
back to back issue of a multiply followed by an mla. The A9
descriptions should handle this with appropriate issue restrictions.
...
I'm not sure from the A9 description whether "following" means
"immediately following", or whether gaps between instructions are
allowed (and, in the latter case, whether the gap can be filled with
arbitrary instructions, or whether restrictions apply, such as
"anything but another NEON multiplication").  Ramana, do you know?
I don't know the answer to that specific question and will have to try
a few experiments.
...
Anyway, I think this explains why the non-SMS loop executes more
quickly than GCC expects, and why the SMS loop is slower than it
needs to be.  It might be worth comparing the two loops with
-mtune=cortex-a8.
Richard

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Agenda for tomorrow's call .