Hi All,
As you know, the compiler currently has difficulties choosing between whether to do an operation in NEON or not.
As I see it there are three problems:
1. Simply, is it profitable?
NEON can do many DImode operations in one or two instructions where 2 to 10 normal ARM/Thumb instructions would be required (not to mention the added register pressure), but there is a cost associated with moving the inputs to NEON, and the results back.
If the data can stay in NEON for more than one operation, then that's even better.
If the data must be loaded from memory, and the result stored back to memory, then it's only a question of whether the register space is available, or not.
Currently these decisions are made in the IRA/reload passes.
2. Values that originate in hard-registers stay there.
This applies to function parameters, mostly, but also in general where the result of an operation is allocated first.
If there is no instruction that can use the value there then the value is 'reloaded' to a more suitable register. If there is any alternative that avoids the move then the register allocator will use it, regardless of the relatives costs of the other alternatives.
This problem is reduced where an operation and move can happen in one instruction, but NEON instructions do not do this much. We can write insns that appear to do it, but these output multiple instructions (see my recent core-SI=>NEON-DI extend patch).
3. It all happens too late.
The decision whether to use NEON or not is not made until register allocation time. Naturally this means that most of the optimization passes are already completed.
Part of the problem is that the operation almost certainly needs splitting (into whatever form was chosen) and this might not be straight forward, post-reload. (However, the split1 pass is already quite late, so perhaps this isn't such a big deal.)
Another part of the problem is that passes such as the two lower-subreg passes make assumptions about the register width which are not accurate if the operation is to end up in NEON.
There are other, lesser problems, such as it being hard to adjust the costs for different cores (A8 in particular) and the cost of generating an immediate constant can't be known until it's known what instructions will be used to generate it.
These problems are not specific to NEON, of course. I believe IWMMXT suffers from the same issues. Likewise the C6X port, and also the i386 MMX to some degree. Anything that has instructions that only operate on a subset of registers, basically.
So, Bernd has suggested an outline of a solution. I've quizzed him on this, added a few of my own ideas, and probably a good selection of misunderstandings, bad assumptions, and general cock ups, and come up with something I can write here for comment. I can post something to upstream later if it doesn't get totally shot down now.
The basic idea is that we add a new RTL optimization pass (or two) that assesses the usage of pseudo registers, and makes recommendations about what register class each should end up in, if there's a choice. These recommendations would then be used by later passes to get a better use of NEON. I might call this the "prealloc" pass, or something.
Firstly, for each pseudo-register in a function, the pass would look at the insn constraints for each "def" and "use", and see how the registers relate to one another. This might determine things like "if rN is in class A, then rM must be also in class A".
E.g. if you have two registers with constraints like this:
"r,w" "r,w"
.. (and 'r' and 'w' do not overlap) then you know that there is a choice between one mode or another, whereas this:
"r,w,r,w" "r,w,w,r"
.. would impose no restrictions and we can carry on as normal.
Having done that we'd end up with sets of pseudo-registers that must make a decision one way or the other, and we'd know where the operations are that would force a move from one class to the other.
There's a fair amount of handwavium in there at present, because I've not worked out what to do with overlapping register classes (think VFP_LO_REGS) and all the other complications.
Secondly, the pass would consider the costs of each alternative, and store a recommended register class for each pseudo-register in a table somewhere. It would also create new pseudos and insert extra move instructions at the register file boundaries where an existing register would have had split recommendations (this would solve problem 2 above).
Again, there's handwavium in "consider the costs". This isn't too hard for size-optimization (assuming the "length" attributes on the insn is correct), but more difficult for speed optimization. Factors to include would be the move costs (here the A8 issues would be addresses) and the relative speeds of the operations in both alternatives. Also, the various possible transition points between the two modes might need some comparisons.
Thirdly, the subsequent passes would need to be modified, as would some of the back-end bits and bobs.
1. Lower-subreg would need to detect 'word_mode' based on the recommended register class, not the global value.
2. The many split patterns in the machine description could be adjusted so that, instead of simply conditionalizing on "reload_completed", they split at split1 if that's the best option. (Maybe it would be profitable to insert a new, earlier split pass specifically for this case to take advantage of the likes of combine? I mean, ideally this decision would have been made at expand time, if it could have been?) It might be useful to *not* split too soon, in some cases, so that the register allocator can still make the final decision based on register pressure, and whatever other factors it uses. Of course, the existing late-split option would need to be retained in case the prealloc pass is disabled, in any case.
3. Various passes would have to be taught not to remove seemingly superfluous register moves where they actually move between register classes.
4. Pretty much nothing would need doing to register allocation! The extra moves should make allocation a register pressure management issue, rather than a question of making it work. DImode operations preallocated to core-registers may already have been lowered, one way or the other (by split1) so there's no decision left there, and if no lowering was necessary then that option ought to be obviously cheaper. If it insists on making contrary decisions then it can be taught to use the recommendation as a hint, perhaps? In specific problem cases it would also be possible to use instruction attributes to disable (or strongly discourage) certain alternatives based on the recommended class.
5. The existing 'onlya8'/'nota8' nonsense can be removed.
6. The register move cost can be set correctly for each core.
7. If a constant is destined for a NEON register, most likely, arm_gen_constant can use the NEON immediate rules to determine the cost.
There's clearly a lot of thought that needs to go into the pseudo-register scan and decision making logic, but the whole thing doesn't look like it'll boil down to very much code in the end.
There's also the question of where to put the pass? Too early and you'd need to put a second one in to reassess the much changed RTL later, and too late and lower-subreg won't be able to use it.
It's possible that it might be better to treat it more like the data-flow analysis where it's not actually a stand-alone pass, but rather a tool other passes can use? That might depend how computationally expensive it is.
Any thoughts anyone? Might something like this actually work? Would it be worth spending the time on this?
Andrew