Here's an implementation of an 8x8 integer DCT done with NEON intrinsics -- essentially a translation of the assembly version in libjpeg-turbo trunk:
https://github.com/mkedwards/crosstool-ng/blob/master/patches/libjpeg-turbo/...
It is in a compilable (on Linaro 2011.05 GCC 4.5, anyway; a recent Linaro 4.6 snapshot ICEs) but otherwise untested state. Still, it's interesting to compare the assembly that it generates against the hand-written version. I thought I'd give linaro-toolchain a heads-up in case y'all could use a test case that generates plenty of pressure on the VFP/NEON register bank. (I intend to use it to see how much performance difference there really is, on the A8 and A9, between NEON code compiled for 16 vs. 32 registers.)
Cheers, - Michael