Basic libav profiling

List overview All Threads
Download

newer

older

[ACTIVITY] August 14-18

Is the Linaro toolchain useful on...

Michael Hope

16 Aug 2011 16 Aug '11

4:44 a.m.

I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

-- Michael

Show replies by date

Richard Sandiford

16 Aug 16 Aug

11:32 a.m.

Michael Hope michael.hope@linaro.org writes:

...

I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

Thanks for putting this together.

...

README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:

for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }

Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.

I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.

Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.

So, maybe some nice areas for future work.

Richard

Michael Hope

17 Aug 17 Aug

11:11 p.m.

On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford richard.sandiford@linaro.org wrote:

...

Michael Hope michael.hope@linaro.org writes:

...
I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

Thanks for putting this together.

...
README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:

for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }

Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.

I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.

Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.

I had a poke about. GCC isn't too happy about unrolled loops either. put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c and is manually unwound by eight as:

for(i=0; i<h; i++){\ OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + D*src[stride+1]));\ OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + D*src[stride+2]));\ OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + D*src[stride+3]));\ OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + D*src[stride+4]));\ OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + D*src[stride+5]));\ OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + D*src[stride+6]));\ OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + D*src[stride+7]));\ OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + D*src[stride+8]));\ dst+= stride;\ src+= stride;\ }\

where OP is an assignment.

Reducing this to:

#define A 3 #define B 4

void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { h /= 8; for (int i = 0; i < h; i++) { dst[0] = A*src[0] + B*src[0+1]; dst[1] = A*src[1] + B*src[1+1]; dst[2] = A*src[2] + B*src[2+1]; dst[3] = A*src[3] + B*src[3+1]; dst[4] = A*src[4] + B*src[4+1]; dst[5] = A*src[5] + B*src[5+1]; dst[6] = A*src[6] + B*src[6+1]; dst[7] = A*src[7] + B*src[7+1]; dst += 8; src += 8; } }

void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { for (int i = 0; i < h; i++) { dst[i] = A*src[i] + B*src[i+1]; } }

plain() gets vectorised where unrolled() doesn't.

-- Michael

Michael Hope

11:43 p.m.

On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope michael.hope@linaro.org wrote:

...

On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford richard.sandiford@linaro.org wrote:

...
Michael Hope michael.hope@linaro.org writes:

...
I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

Thanks for putting this together.

...
README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:

for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }

Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.

I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.

Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.

I had a poke about. GCC isn't too happy about unrolled loops either. put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c and is manually unwound by eight as:

for(i=0; i<h; i++){\ OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + D*src[stride+1]));\ OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + D*src[stride+2]));\ OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + D*src[stride+3]));\ OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + D*src[stride+4]));\ OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + D*src[stride+5]));\ OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + D*src[stride+6]));\ OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + D*src[stride+7]));\ OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + D*src[stride+8]));\ dst+= stride;\ src+= stride;\ }\

where OP is an assignment.

Reducing this to:

#define A 3 #define B 4

void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { h /= 8; for (int i = 0; i < h; i++) { dst[0] = A*src[0] + B*src[0+1]; dst[1] = A*src[1] + B*src[1+1]; dst[2] = A*src[2] + B*src[2+1]; dst[3] = A*src[3] + B*src[3+1]; dst[4] = A*src[4] + B*src[4+1]; dst[5] = A*src[5] + B*src[5+1]; dst[6] = A*src[6] + B*src[6+1]; dst[7] = A*src[7] + B*src[7+1]; dst += 8; src += 8; } }

void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { for (int i = 0; i < h; i++) { dst[i] = A*src[i] + B*src[i+1]; } }

plain() gets vectorised where unrolled() doesn't.

How can I tell the vectoriser that a input is a multiple of something? For example, this code:

struct image { uint8_t d[4096]; } __attribute__((aligned(128)));

void fixed(struct image * __restrict dst, struct image * __restrict src, int h) { for (int i = 0; i < 16; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

is lovely with no peeling or argument checking.

I'd like to do a specialisation of a function where I assert that the height is a multiple of 16 without unrolling the loop myself. Something like:

void multiple(struct image * __restrict dst, struct image * __restrict src, int h) { h &= ~15;

for (int i = 0; i < h; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

The inner loop looks good but it still includes a prologue that tests for h < vector size and an epilogue that handles any remaining bytes. The epilogue is only a code size problem as it's normally skipped. Still, the skipping requires a branch...

-- Michael

Ira Rosen

18 Aug 18 Aug

5:56 a.m.

On 18 August 2011 02:43, Michael Hope michael.hope@linaro.org wrote:

...

On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope michael.hope@linaro.org wrote:

...
On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford richard.sandiford@linaro.org wrote:

...
Michael Hope michael.hope@linaro.org writes:

...
I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

Thanks for putting this together.

...
README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:

for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }

Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.

I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.

Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.

I had a poke about. GCC isn't too happy about unrolled loops either. put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c and is manually unwound by eight as:

for(i=0; i<h; i++){\ OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + D*src[stride+1]));\ OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + D*src[stride+2]));\ OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + D*src[stride+3]));\ OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + D*src[stride+4]));\ OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + D*src[stride+5]));\ OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + D*src[stride+6]));\ OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + D*src[stride+7]));\ OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + D*src[stride+8]));\ dst+= stride;\ src+= stride;\ }\

where OP is an assignment.

Reducing this to:

#define A 3 #define B 4

void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { h /= 8; for (int i = 0; i < h; i++) { dst[0] = A*src[0] + B*src[0+1]; dst[1] = A*src[1] + B*src[1+1]; dst[2] = A*src[2] + B*src[2+1]; dst[3] = A*src[3] + B*src[3+1]; dst[4] = A*src[4] + B*src[4+1]; dst[5] = A*src[5] + B*src[5+1]; dst[6] = A*src[6] + B*src[6+1]; dst[7] = A*src[7] + B*src[7+1]; dst += 8; src += 8; } }

void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { for (int i = 0; i < h; i++) { dst[i] = A*src[i] + B*src[i+1]; } }

plain() gets vectorised where unrolled() doesn't.

How can I tell the vectoriser that a input is a multiple of something?

Unfortunately, I don't think you can.

...

For example, this code:

struct image { uint8_t d[4096]; } __attribute__((aligned(128)));

void fixed(struct image * __restrict dst, struct image * __restrict src, int h) { for (int i = 0; i < 16; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

is lovely with no peeling or argument checking.

I'd like to do a specialisation of a function where I assert that the height is a multiple of 16 without unrolling the loop myself. Something like:

void multiple(struct image * __restrict dst, struct image * __restrict src, int h) { h &= ~15;

for (int i = 0; i < h; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

The inner loop looks good but it still includes a prologue that tests for h < vector size and an epilogue that handles any remaining bytes. The epilogue is only a code size problem as it's normally skipped. Still, the skipping requires a branch...

Yes, that would be a nice feature, although I think such hints are rare.

Ira

...

-- Michael

linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Andrew Stubbs

9:45 a.m.

On 18/08/11 06:56, Ira Rosen wrote:

...

...
How can I tell the vectoriser that a input is a multiple of something?

Unfortunately, I don't think you can.

I think you can do something like this:

void multiple(struct image * __restrict dst, struct image * __restrict src, int h) { if (h & 0xf) __gcc_unreachable ();

for (int i = 0; i < h; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

[Just off the top of my head - you'd have to check the syntax for gcc_unreachable.]

That should allow the value range propagation to do the right thing whilst inserting no real code, but whether that's properly hooked into vectorization I have no idea?

Andrew

Ira Rosen

10:16 a.m.

On 18 August 2011 12:45, Andrew Stubbs andrew.stubbs@linaro.org wrote:

...

On 18/08/11 06:56, Ira Rosen wrote:

...
...
How can I tell the vectoriser that a input is a multiple of something?

Unfortunately, I don't think you can.

I think you can do something like this:

void multiple(struct image * __restrict dst, struct image * __restrict src, int h) { if (h & 0xf) __gcc_unreachable ();

for (int i = 0; i < h; i++) { dst->d[i] = A*src->d[i] + B*src->d[i+1]; } }

[Just off the top of my head - you'd have to check the syntax for gcc_unreachable.]

That should allow the value range propagation to do the right thing whilst inserting no real code, but whether that's properly hooked into vectorization I have no idea?

Yes, the problem is that the vectorizer (or more precisely loop iteration analysis in tree-ssa-loop-niter.c) doesn't use this information.

Ira

...

Andrew

Richard Sandiford

8:21 a.m.

Michael Hope michael.hope@linaro.org writes:

...

On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford richard.sandiford@linaro.org wrote:

...
Michael Hope michael.hope@linaro.org writes:

...
I put a build harness around libav and gathered some profiling data. See: bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

It includes a Makefile that builds a C only, h.264 only decoder and two Creative Commons licensed videos to use as input.

Thanks for putting this together.

...
README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll look into extracting and harnessing the functions themselves later this week.

I had a look why auto-vectorisation wasn't having much effect. It looks from your profile that most of the hot functions are operating on 16x16 blocks of pixels with an unknown line stride. So the C code looks like:

for (i = 0; i < 16; i++) { x[0] = OP (x[0]); ... x[15] = OP (x[15]); x += stride; }

Because of the unknown stride, we're relying on SLP rather than loop-based vectorisation to handle this kind of loop. The problem is that SLP is being run _as_ a loop optimisation. At the moment, the gimple data-ref analysis code assumes that, during a loop optimisation, only simple induction variables are of interest, so it treats all of the x[...] references above as unrepresentable. If I move SLP outside the loop optimisations (just as a proof of concept), then that problem goes away.

I talked about this with Ira, who said that SLP had been placed where it is because ivopts (a later loop optimisation) obfuscates things too much. As Ira said, we should probably look at (conditionally) removing the assumption that only IVs are of interest during loop optimisations.

Another problem is that SLP supports a much smaller range of optimisations than the loop-based vectoriser. There's no support for promotion, demotion, or conditional expressions. This affects things like the weight_h264_pixels* functions, which contain conditional moves.

I had a poke about. GCC isn't too happy about unrolled loops either.

Right. Sorry, I should have been clearer, but this hand-unrolling was the trigger for this loop being SLP's job, rather than the normal loop vectoriser's. So the loop above was exactly the kind of loop you describe (OP was the same for each x[...]).

SLP should still (in theory) be able to optimise the loop body as straight-line code. The problem is that it doesn't yet support the same range of operations.

Richard

Revital1 Eres

7:04 a.m.

Hi,

...

I put a build harness around libav and gathered some profiling data.

See:

...

bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite

Thanks!

...

README.rst has the basic commands for running ffmpeg and initial perf results showing the hot functions. Dave, 20 % of the time is spent in memcpy() so you might want to have a look.

FWIW I usually suspect when the profiling info shows that helper functions are the hottest. It might be that a bigger input should be used to stress out the functions which do the real computation.

Thanks, Revital

5109

days inactive

5111

days old

linaro-toolchain@lists.linaro.org

8 comments

participants

tags (0)

participants (5)

Andrew Stubbs
Ira Rosen
Michael Hope
Revital1 Eres
Richard Sandiford