[RFC] NEON vs. ARM register selection - linaro-toolchain

2 Mar 2012


      Hi All,
As you know, the compiler currently has difficulties choosing between 
whether to do an operation in NEON or not.
As I see it there are three problems:
1. Simply, is it profitable?
NEON can do many DImode operations in one or two instructions
      where 2 to 10 normal ARM/Thumb instructions would be required
      (not to mention the added register pressure), but there is a
      cost associated with moving the inputs to NEON, and the results
      back.
If the data can stay in NEON for more than one operation,
      then that's even better.
If the data must be loaded from memory, and the result stored back
      to memory, then it's only a question of whether the register space
      is available, or not.
Currently these decisions are made in the IRA/reload passes.
2. Values that originate in hard-registers stay there.
This applies to function parameters, mostly, but also in general
      where the result of an operation is allocated first.
If there is no instruction that can use the value there then the
      value is 'reloaded' to a more suitable register. If there is any
      alternative that avoids the move then the register allocator will
      use it, regardless of the relatives costs of the other
      alternatives.
This problem is reduced where an operation and move can happen in
      one instruction, but NEON instructions do not do this much. We can
      write insns that appear to do it, but these output multiple
      instructions (see my recent core-SI=>NEON-DI extend patch).
3. It all happens too late.
The decision whether to use NEON or not is not made until register
      allocation time. Naturally this means that most of the optimization
      passes are already completed.
Part of the problem is that the operation almost certainly needs
      splitting (into whatever form was chosen) and this might not be
      straight forward, post-reload. (However, the split1 pass is
      already quite late, so perhaps this isn't such a big deal.)
Another part of the problem is that passes such as the two
      lower-subreg passes make assumptions about the register width which
      are not accurate if the operation is to end up in NEON.
There are other, lesser problems, such as it being hard to adjust the 
costs for different cores (A8 in particular) and the cost of generating 
an immediate constant can't be known until it's known what instructions 
will be used to generate it.
These problems are not specific to NEON, of course. I believe IWMMXT 
suffers from the same issues. Likewise the C6X port, and also the i386 
MMX to some degree. Anything that has instructions that only operate on 
a subset of registers, basically.
So, Bernd has suggested an outline of a solution. I've quizzed him on 
this, added a few of my own ideas, and probably a good selection of 
misunderstandings, bad assumptions, and general cock ups, and come up 
with something I can write here for comment. I can post something to 
upstream later if it doesn't get totally shot down now.
The basic idea is that we add a new RTL optimization pass (or two) that 
assesses the usage of pseudo registers, and makes recommendations about 
what register class each should end up in, if there's a choice. These 
recommendations would then be used by later passes to get a better use 
of NEON. I might call this the "prealloc" pass, or something.
Firstly, for each pseudo-register in a function, the pass would look at 
the insn constraints for each "def" and "use", and see how the registers 
relate to one another. This might determine things like "if rN is in 
class A, then rM must be also in class A".
E.g. if you have two registers with constraints like this:
"r,w"
      "r,w"
.. (and 'r' and 'w' do not overlap) then you know that there is a choice 
between one mode or another, whereas this:
"r,w,r,w"
      "r,w,w,r"
.. would impose no restrictions and we can carry on as normal.
Having done that we'd end up with sets of pseudo-registers that must 
make a decision one way or the other, and we'd know where the operations 
are that would force a move from one class to the other.
There's a fair amount of handwavium in there at present, because I've 
not worked out what to do with overlapping register classes (think 
VFP_LO_REGS) and all the other complications.
Secondly, the pass would consider the costs of each alternative, and 
store a recommended register class for each pseudo-register in a table 
somewhere. It would also create new pseudos and insert extra move 
instructions at the register file boundaries where an existing register 
would have had split recommendations (this would solve problem 2 above).
Again, there's handwavium in "consider the costs". This isn't too hard 
for size-optimization (assuming the "length" attributes on the insn is 
correct), but more difficult for speed optimization. Factors to include 
would be the move costs (here the A8 issues would be addresses) and the 
relative speeds of the operations in both alternatives. Also, the 
various possible transition points between the two modes might need some 
comparisons.
Thirdly, the subsequent passes would need to be modified, as would some 
of the back-end bits and bobs.
1. Lower-subreg would need to detect 'word_mode' based on the 
recommended register class, not the global value.
2. The many split patterns in the machine description could be adjusted 
so that, instead of simply conditionalizing on "reload_completed", they 
split at split1 if that's the best option. (Maybe it would be profitable 
to insert a new, earlier split pass specifically for this case to take 
advantage of the likes of combine? I mean, ideally this decision would 
have been made at expand time, if it could have been?) It might be 
useful to *not* split too soon, in some cases, so that the register 
allocator can still make the final decision based on register pressure, 
and whatever other factors it uses. Of course, the existing late-split 
option would need to be retained in case the prealloc pass is disabled, 
in any case.
3. Various passes would have to be taught not to remove seemingly 
superfluous register moves where they actually move between register 
classes.
4. Pretty much nothing would need doing to register allocation! The 
extra moves should make allocation a register pressure management issue, 
rather than a question of making it work. DImode operations preallocated 
to core-registers may already have been lowered, one way or the other 
(by split1) so there's no decision left there, and if no lowering was 
necessary then that option ought to be obviously cheaper. If it insists 
on making contrary decisions then it can be taught to use the 
recommendation as a hint, perhaps? In specific problem cases it would 
also be possible to use instruction attributes to disable (or strongly 
discourage) certain alternatives based on the recommended class.
5. The existing 'onlya8'/'nota8' nonsense can be removed.
6. The register move cost can be set correctly for each core.
7. If a constant is destined for a NEON register, most likely, 
arm_gen_constant can use the NEON immediate rules to determine the cost.
There's clearly a lot of thought that needs to go into the 
pseudo-register scan and decision making logic, but the whole thing 
doesn't look like it'll boil down to very much code in the end.
There's also the question of where to put the pass? Too early and you'd 
need to put a second one in to reassess the much changed RTL later, and 
too late and lower-subreg won't be able to use it.
It's possible that it might be better to treat it more like the 
data-flow analysis where it's not actually a stand-alone pass, but 
rather a tool other passes can use? That might depend how 
computationally expensive it is.
Any thoughts anyone? Might something like this actually work? Would it 
be worth spending the time on this?
Andrew