The basic idea is that we add a new RTL optimization pass (or two) that assesses the usage of pseudo registers, and makes recommendations about what register class each should end up in, if there's a choice. These recommendations would then be used by later passes to get a better use of NEON. I might call this the "prealloc" pass, or something.
That sounds very much like the pre-reload that "new-ra" had at one point (http://gcc.gnu.org/viewcvs/branches/new-regalloc-branch/gcc/pre-reload.c). The problem with pre-reload for new-ra was that it was basically reload instead of something nicer and cleaner. It also only ran just before the register allocator, which is too late for the problem you are trying to solve.
Firstly, for each pseudo-register in a function, the pass would look at the insn constraints for each "def" and "use", and see how the registers relate to one another. This might determine things like "if rN is in class A, then rM must be also in class A".
At SUSE I tried to do this with the webizer pass (web.c). I wrote down the ideas we implemented at the time (see http://gcc.gnu.org/ml/gcc/2005-01/msg00179.html):
- web class, to replace regclass and choose register classes webs instead of pseudos. This also includes splitting webs if a register in a web really wants to be in two different classes to satisfy constraints in two different insns. Right now, as far as I understand, regclass just picks one and lets reload figure out how to fix up that mistake. - A semi-strict RTL mode. Right now there is just strict and non-strict. On the branch there is a semi-strict mode which is the same as strict RTL except that pseudo-registers are still allowed. - pre-reload (which is related to web class) to make sure as many insn constraints as possible are satisfied before the register allocator goes to work. Basically, after pre-reload the insns stream should be in semi-strict RTL form.
I used the webizer to unify defs and uses. I would split a web if it needed multiple register classes (I inserted a mov, without checking that a move existed from the source to the target register class), and I put pseudos r1 and r2 in the same register class if there was an insn (set (r1) (r2)) somewhere. The selection of the register classes had a cost function, but I used rtx_cost, which is not very effective, really. But I never took this experiment very far because for x86-64 the plan didn't work as well as I had hoped. I don't remember the details, but the biggest problem I had with the experimental implementation of these ideas (apart from lots of trouble with recog for semi-strict RTL) was that there is a bit of an ordering problem between combine on the one hand, and web-based register classes. If you assign classes too early and don't allow things to change, then combine fails too often. If you assign register classes after combine, you may not get the instructions selected the way you want them to be.
This was when GCC still had the old local-alloc.c and global.c allocators. Things may be different (better) with IRA and the upcoming LRA stuff.
If you plan to work on this, I would suggest you discuss the plan on the GCC mailing list also, with Jeff Law and Vladimir Makarov in CC because they are working on a reload rewrite (LRA).
Ciao! Steven