On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope michael.hope@linaro.org wrote:
On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford richard.sandiford@linaro.org wrote:
Michael Hope michael.hope@linaro.org writes:
I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite
It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.
Thanks for putting this together.
README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.
The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.
I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:
for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }
Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.
I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.
Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.
I had a poke about. GCC isn't too happy about unrolled loops either. put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c and is manually unwound by eight as:
for(i=0; i<h; i++){\ OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + D*src[stride+1]));\ OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + D*src[stride+2]));\ OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + D*src[stride+3]));\ OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + D*src[stride+4]));\ OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + D*src[stride+5]));\ OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + D*src[stride+6]));\ OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + D*src[stride+7]));\ OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + D*src[stride+8]));\ dst+= stride;\ src+= stride;\ }\
where OP is an assignment.
Reducing this to:
#define A 3 #define B 4
void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { h /= 8; for (int i = 0; i < h; i++) { dst[0] = A*src[0] + B*src[0+1]; dst[1] = A*src[1] + B*src[1+1]; dst[2] = A*src[2] + B*src[2+1]; dst[3] = A*src[3] + B*src[3+1]; dst[4] = A*src[4] + B*src[4+1]; dst[5] = A*src[5] + B*src[5+1]; dst[6] = A*src[6] + B*src[6+1]; dst[7] = A*src[7] + B*src[7+1]; dst += 8; src += 8; } }
void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { for (int i = 0; i < h; i++) { dst[i] = A*src[i] + B*src[i+1]; } }
plain() gets vectorised where unrolled() doesn't.
How can I tell the vectoriser that a input is a multiple of something? For example, this code:
struct image { uint8_t d[4096]; } __attribute__((aligned(128)));
void fixed(struct image * __restrict dst, struct image * __restrict src, int h) { for (int i = 0; i < 16; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }
is lovely with no peeling or argument checking.
I'd like to do a specialisation of a function where I assert that the height is a multiple of 16 without unrolling the loop myself. Something like:
void multiple(struct image * __restrict dst, struct image * __restrict src, int h) { h &= ~15;
for (int i = 0; i < h; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }
The inner loop looks good but it still includes a prologue that tests for h < vector size and an epilogue that handles any remaining bytes. The epilogue is only a code size problem as it's normally skipped. Still, the skipping requires a branch...
-- Michael