Jonathan -
I'm inviting you to this conversation (and to linaro-mm-sig, if you'd care to participate!), because I'd really like your commentary on what it takes to make write-combining fully effective on various ARMv7 implementations.
The current threads: http://lists.linaro.org/pipermail/linaro-mm-sig/2011-June/000334.html http://lists.linaro.org/pipermail/linaro-mm-sig/2011-June/000263.html
Archive link for a related discussion: http://lists.linaro.org/pipermail/linaro-mm-sig/2011-April/000003.html
Getting full write-combining performance on Intel architectures involves a somewhat delicate dance: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-fr...
And I expect something similar to be necessary in order to avoid the read-modify-write penalty for write-combining buffers on ARMv7. (NEON store-multiple operations can fill an entire 64-byte entry in the victim buffer in one opcode; I don't know whether this is enough to stop the L3 memory system from reading the data before clobbering it.)
Cheers, - Michael