Re: Basic libav profiling

17 Aug 2011


      On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope michael.hope@linaro.org wrote:
...
On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford
richard.sandiford@linaro.org wrote:
...
Michael Hope michael.hope@linaro.org writes:
...
I put a build harness around libav and gathered some profiling data.  See:
 bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite
It includes a Makefile that builds a C only, h.264 only decoder and
two Creative Commons licensed videos to use as input.
Thanks for putting this together.
...
README.rst has the basic commands for running ffmpeg and initial perf
results showing the hot functions.  Dave, 20 % of the time is spent in
memcpy() so you might want to have a look.
The vectoriser has no effect.  GCC 4.5 is ~17 % faster than 4.6.  I'll
look into extracting and harnessing the functions themselves later
this week.
I had a look why auto-vectorisation wasn't having much effect.
It looks from your profile that most of the hot functions are
operating on 16x16 blocks of pixels with an unknown line stride.
So the C code looks like:
for (i = 0; i < 16; i++)
     {
       x[0] = OP (x[0]);
       ...
       x[15] = OP (x[15]);
       x += stride;
     }
Because of the unknown stride, we're relying on SLP rather than
loop-based vectorisation to handle this kind of loop.  The problem
is that SLP is being run _as_ a loop optimisation.  At the moment,
the gimple data-ref analysis code assumes that, during a loop
optimisation, only simple induction variables are of interest,
so it treats all of the x[...] references above as unrepresentable.
If I move SLP outside the loop optimisations (just as a proof of concept),
then that problem goes away.
I talked about this with Ira, who said that SLP had been placed
where it is because ivopts (a later loop optimisation) obfuscates
things too much.  As Ira said, we should probably look at (conditionally)
removing the assumption that only IVs are of interest during loop
optimisations.
Another problem is that SLP supports a much smaller range of
optimisations than the loop-based vectoriser.  There's no support
for promotion, demotion, or conditional expressions.  This affects
things like the weight_h264_pixels* functions, which contain
conditional moves.
I had a poke about.  GCC isn't too happy about unrolled loops either.
put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c
and is manually unwound by eight as:
for(i=0; i<h; i++){\
           OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] +
D*src[stride+1]));\
           OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] +
D*src[stride+2]));\
           OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] +
D*src[stride+3]));\
           OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] +
D*src[stride+4]));\
           OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] +
D*src[stride+5]));\
           OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] +
D*src[stride+6]));\
           OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] +
D*src[stride+7]));\
           OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] +
D*src[stride+8]));\
           dst+= stride;\
           src+= stride;\
       }\
where OP is an assignment.
Reducing this to:
#define A 3
#define B 4
void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h)
{
   h /= 8;
   for (int i = 0; i < h; i++) {
       dst[0] = A*src[0] + B*src[0+1];
       dst[1] = A*src[1] + B*src[1+1];
       dst[2] = A*src[2] + B*src[2+1];
       dst[3] = A*src[3] + B*src[3+1];
       dst[4] = A*src[4] + B*src[4+1];
       dst[5] = A*src[5] + B*src[5+1];
       dst[6] = A*src[6] + B*src[6+1];
       dst[7] = A*src[7] + B*src[7+1];
       dst += 8;
       src += 8;
   }
}
void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h)
{
   for (int i = 0; i < h; i++) {
       dst[i] = A*src[i] + B*src[i+1];
   }
}
plain() gets vectorised where unrolled() doesn't.
How can I tell the vectoriser that a input is a multiple of something?
 For example, this code:
struct image
{
    uint8_t d[4096];
} __attribute__((aligned(128)));
void fixed(struct image * __restrict dst, struct image * __restrict src, int h)
{
    for (int i = 0; i < 16; i++) {
        dst->d[i] = A*src->d[i] + B*src->d[i+1];
    }
}
is lovely with no peeling or argument checking.
I'd like to do a specialisation of a function where I assert that the
height is a multiple of 16 without unrolling the loop myself.
Something like:
void multiple(struct image * __restrict dst, struct image * __restrict
src, int h)
{
    h &= ~15;
for (int i = 0; i < h; i++) {
        dst->d[i] = A*src->d[i] + B*src->d[i+1];
    }
}
The inner loop looks good but it still includes a prologue that tests
for h < vector size and an epilogue that handles any remaining bytes.
The epilogue is only a code size problem as it's normally skipped.
Still, the skipping requires a branch...
-- Michael

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Basic libav profiling