I've create a blueprint covering the basic functions in libav that are implemented as inline assembly: https://blueprints.launchpad.net/gcc-linaro/+spec/investigate-libav-inline-a...
These are a mix of multiplies, clipping, byte swap, and unaligned access. We do OK on half of them but at least byte swap and 32x32 -> top half of 64 aren't as good as they could be.
Let's discuss the investigation at the next performance meeting.
-- Michael