I ran my usual set of benchmarks of libav compiled with the current gcc releases (hand-written assembly disabled). The results are in this spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AguHvNGaLXy9dHExeWZ1YWZ1c0s2Vnp...
First the good news, almost everything is faster with 4.6+ than with linaro-4.5.
The bad news is that some things have regressed since 4.6, even if not all the way back to 4.5 levels. A few especially problematic pieces stand out:
- The mp3 test performs 5-15% worse. This regression is (mostly) attributable to the ff_mpadsp_apply_window_fixed [1] function. We have looked at this one before.
- FLAC is 9% slower in upstream 4.7/4.8 compared to Linaro releases. Here flac_lpc_16_c [2] and flac_decorrelate_indep_c_16 [3] are mainly to blame.
- MPEG2/MPEG4 decoding is ~10% slower with vectorisation turned on. The culprit here is ff_simple_idct_8_c [4] function.
- H.264 and DTS seem 1-2% slower, although this could be just noise.
- Code size has increased by ~10% in all post-4.6 releases.
In all cases, compiled with -O3 -mcpu=cortex-a9. The vectorised builds all use -fvect-cost-model. Without this flag the results are much worse.
[1] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/mpegaudiodsp_templ... [2] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/flacdsp.c%3Bh=a2e3... [3] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/flacdsp_template.c... [4] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/simple_idct_templa...
Thanks for doing this yet again.
On 6 July 2012 16:52, Mans Rullgard mans.rullgard@linaro.org wrote:
I ran my usual set of benchmarks of libav compiled with the current gcc releases (hand-written assembly disabled). The results are in this spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AguHvNGaLXy9dHExeWZ1YWZ1c0s2Vnp...
First the good news, almost everything is faster with 4.6+ than with linaro-4.5.
The bad news is that some things have regressed since 4.6, even if not all the way back to 4.5 levels. A few especially problematic pieces stand out:
could you pull out pre-processed output ?
- The mp3 test performs 5-15% worse. This regression is (mostly) attributable to the ff_mpadsp_apply_window_fixed [1] function. We have looked at this one before.
Yes , we do need to create a testcase out of this one and work through it.
- FLAC is 9% slower in upstream 4.7/4.8 compared to Linaro releases. Here flac_lpc_16_c [2] and flac_decorrelate_indep_c_16 [3] are mainly to blame.
I'll look at these when I do the sched-pressure stuff upstream.
regards, Ramana
On 6 July 2012 16:52, Mans Rullgard mans.rullgard@linaro.org wrote:
I ran my usual set of benchmarks of libav compiled with the current gcc releases (hand-written assembly disabled). The results are in this spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AguHvNGaLXy9dHExeWZ1YWZ1c0s2Vnp...
First the good news, almost everything is faster with 4.6+ than with linaro-4.5.
The bad news is that some things have regressed since 4.6, even if not all the way back to 4.5 levels. A few especially problematic pieces stand out:
The mp3 test performs 5-15% worse. This regression is (mostly) attributable to the ff_mpadsp_apply_window_fixed [1] function. We have looked at this one before.
FLAC is 9% slower in upstream 4.7/4.8 compared to Linaro releases. Here flac_lpc_16_c [2] and flac_decorrelate_indep_c_16 [3] are mainly to blame.
Looking at this in the middle of the summit - In the flac_lpc_16_c code in the vectorized case could you take a look with perf and say which part is hot ?
is it the top level nested loop over i and j or is it the loop that does a summation when i < len ?
The non-vectorized case looks interesting because it might be a fallout with sched-pressure.
MPEG2/MPEG4 decoding is ~10% slower with vectorisation turned on. The culprit here is ff_simple_idct_8_c [4] function.
H.264 and DTS seem 1-2% slower, although this could be just noise.
Code size has increased by ~10% in all post-4.6 releases.
In all cases, compiled with -O3 -mcpu=cortex-a9. The vectorised builds all use -fvect-cost-model. Without this flag the results are much worse.
[1] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/mpegaudiodsp_templ... [2] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/flacdsp.c%3Bh=a2e3... [3] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/flacdsp_template.c... [4] http://git.libav.org/?p=libav.git%3Ba=blob%3Bf=libavcodec/simple_idct_templa...
-- Mans Rullgard / mru
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
On 10 July 2012 14:57, Ramana Radhakrishnan ramana.radhakrishnan@linaro.org wrote:
On 6 July 2012 16:52, Mans Rullgard mans.rullgard@linaro.org wrote:
I ran my usual set of benchmarks of libav compiled with the current gcc releases (hand-written assembly disabled). The results are in this spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AguHvNGaLXy9dHExeWZ1YWZ1c0s2Vnp...
First the good news, almost everything is faster with 4.6+ than with linaro-4.5.
The bad news is that some things have regressed since 4.6, even if not all the way back to 4.5 levels. A few especially problematic pieces stand out:
The mp3 test performs 5-15% worse. This regression is (mostly) attributable to the ff_mpadsp_apply_window_fixed [1] function. We have looked at this one before.
FLAC is 9% slower in upstream 4.7/4.8 compared to Linaro releases. Here flac_lpc_16_c [2] and flac_decorrelate_indep_c_16 [3] are mainly to blame.
Looking at this in the middle of the summit - In the flac_lpc_16_c code in the vectorized case could you take a look with perf and say which part is hot ?
is it the top level nested loop over i and j or is it the loop that does a summation when i < len ?
The non-vectorized case looks interesting because it might be a fallout with sched-pressure.
Here's the perf annotate output for that function from 4.8 trunk with vectorisation enabled:
Percent | Source code & Disassembly of avconv ------------------------------------------------ : : : : Disassembly of section .text: : : 002aa55c <flac_lpc_16_c>: : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { 0.02 : 2aa55c: push {r4, r5, r6, r7, r8, r9, sl, fp} 0.00 : 2aa560: sub sp, sp, #80 ; 0x50 0.00 : 2aa564: str r0, [sp, #68] ; 0x44 : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.00 : 2aa568: ldr r0, [sp, #112] ; 0x70 : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { 0.00 : 2aa56c: str r2, [sp, #60] ; 0x3c 0.00 : 2aa570: str r1, [sp, #52] ; 0x34 : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.00 : 2aa574: sub r0, r0, #1 : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { 0.00 : 2aa578: str r3, [sp, #56] ; 0x38 : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.00 : 2aa57c: cmp r2, r0 0.00 : 2aa580: str r0, [sp, #72] ; 0x48 0.00 : 2aa584: bge 2aa93c <flac_lpc_16_c+0x3e0> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa588: add r3, r2, #4 0.00 : 2aa58c: mov r8, r2 0.00 : 2aa590: lsl r3, r3, #2 0.00 : 2aa594: sub r2, r2, #10 0.00 : 2aa598: ldr sl, [sp, #68] ; 0x44 0.00 : 2aa59c: bic r2, r2, #7 0.00 : 2aa5a0: ldr ip, [sp, #68] ; 0x44 0.00 : 2aa5a4: mov r0, r1 0.00 : 2aa5a8: rsb r2, r2, r8 0.00 : 2aa5ac: sub r1, r3, #16 0.00 : 2aa5b0: sub r9, r8, #1 0.00 : 2aa5b4: add sl, sl, #16 0.00 : 2aa5b8: add r3, ip, r3 0.00 : 2aa5bc: add r1, r0, r1 0.00 : 2aa5c0: sub r2, r2, #9 0.00 : 2aa5c4: str r9, [sp, #64] ; 0x40 0.00 : 2aa5c8: str sl, [sp, #44] ; 0x2c 0.00 : 2aa5cc: str r3, [sp, #36] ; 0x24 0.00 : 2aa5d0: str r1, [sp, #76] ; 0x4c 0.00 : 2aa5d4: str r2, [sp, #40] ; 0x28 0.00 : 2aa5d8: str r8, [sp, #48] ; 0x30 : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 0.00 : 2aa5dc: ldr r1, [sp, #64] ; 0x40 1.03 : 2aa5e0: ldr r2, [sp, #44] ; 0x2c 0.00 : 2aa5e4: cmp r1, #0 0.96 : 2aa5e8: pld [r2] : { : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; 0.00 : 2aa5ec: ldr r4, [r2, #-16] : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 1.19 : 2aa5f0: ble 2aa930 <flac_lpc_16_c+0x3d4> 0.00 : 2aa5f4: ldr r8, [sp, #60] ; 0x3c 1.68 : 2aa5f8: mov r3, #0 0.00 : 2aa5fc: cmp r8, #9 1.37 : 2aa600: ble 2aa924 <flac_lpc_16_c+0x3c8> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa604: ldr r0, [sp, #76] ; 0x4c 0.70 : 2aa608: mov r5, r1 0.00 : 2aa60c: sub r6, r2, #16 : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; 0.56 : 2aa610: mov fp, r3 0.00 : 2aa614: mov r1, r4 0.67 : 2aa618: mov r7, r5 : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa61c: ldr r8, [r0, #-4] : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 2.31 : 2aa620: sub r7, r7, #8 : c = coeffs[j]; : s0 += c*d; : d = decoded[i-j]; 0.00 : 2aa624: ldr r5, [r6, #4] 0.61 : 2aa628: pld [r0, #-64] ; 0x40 0.00 : 2aa62c: ldr r9, [r6, #8] 0.58 : 2aa630: pld [r6, #64] ; 0x40 : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa634: ldr r2, [r0, #-8] 0.72 : 2aa638: ldr ip, [r0, #-12] : s0 += c*d; 0.00 : 2aa63c: mla r1, r1, r8, fp : d = decoded[i-j]; 0.00 : 2aa640: str r9, [sp, #16] 1.30 : 2aa644: ldr r9, [r6, #12] : s1 += c*d; 0.00 : 2aa648: mla fp, r5, r8, r3 : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; : d = decoded[i-j]; 0.79 : 2aa64c: ldr r3, [r6, #20] : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 1.75 : 2aa650: str ip, [sp, #8] : s0 += c*d; : d = decoded[i-j]; 0.00 : 2aa654: str r9, [sp, #20] : s1 += c*d; 0.54 : 2aa658: ldr r9, [sp, #16] : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.00 : 2aa65c: mla r8, r2, r5, r1 : d = decoded[i-j]; 0.00 : 2aa660: str r3, [sp, #28] : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 1.23 : 2aa664: ldr r3, [sp, #8] 0.00 : 2aa668: ldr sl, [sp, #40] ; 0x28 : d = decoded[i-j]; : s1 += c*d; 0.70 : 2aa66c: mla r2, r9, r2, fp : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa670: ldr ip, [r0, #-20] 1.23 : 2aa674: ldr r5, [r0, #-24] 0.00 : 2aa678: cmp r7, sl 0.70 : 2aa67c: ldr sl, [r0, #-16] : s0 += c*d; 0.00 : 2aa680: mla r8, r3, r9, r8 : d = decoded[i-j]; 0.00 : 2aa684: ldr r4, [r6, #24] : s1 += c*d; 0.96 : 2aa688: str r2, [sp, #32] 0.00 : 2aa68c: ldr r9, [sp, #32] : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.54 : 2aa690: str sl, [sp, #24] : s0 += c*d; 0.00 : 2aa694: str r8, [sp, #16] : d = decoded[i-j]; : s1 += c*d; 0.58 : 2aa698: ldr r8, [sp, #20] : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; : d = decoded[i-j]; 0.00 : 2aa69c: ldr sl, [r6, #16] 0.58 : 2aa6a0: ldr fp, [r6, #28] : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa6a4: str ip, [sp, #4] : s0 += c*d; : d = decoded[i-j]; : s1 += c*d; 1.17 : 2aa6a8: mla r8, r8, r3, r9 : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.54 : 2aa6ac: ldr r9, [sp, #16] 1.19 : 2aa6b0: ldr r3, [sp, #20] : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa6b4: ldr ip, [r0, #-28] : s0 += c*d; : d = decoded[i-j]; 1.28 : 2aa6b8: ldr r1, [r6, #32]! : s1 += c*d; 0.00 : 2aa6bc: str r8, [sp, #32] : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.63 : 2aa6c0: ldr r8, [sp, #24] : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 0.00 : 2aa6c4: ldr r2, [r0, #-32]! : s0 += c*d; 0.72 : 2aa6c8: mla r3, r8, r3, r9 0.52 : 2aa6cc: str r3, [sp, #16] : d = decoded[i-j]; : s1 += c*d; 1.28 : 2aa6d0: ldr r3, [sp, #32] 0.00 : 2aa6d4: mla r9, sl, r8, r3 : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.47 : 2aa6d8: ldr r8, [sp, #4] 1.44 : 2aa6dc: ldr r3, [sp, #16] 0.00 : 2aa6e0: mla sl, r8, sl, r3 : d = decoded[i-j]; : s1 += c*d; 0.00 : 2aa6e4: ldr r3, [sp, #28] 1.66 : 2aa6e8: mla r9, r3, r8, r9 : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.61 : 2aa6ec: mla r3, r5, r3, sl : d = decoded[i-j]; : s1 += c*d; 0.61 : 2aa6f0: mla r5, r4, r5, r9 : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 0.72 : 2aa6f4: mla r4, ip, r4, r3 : d = decoded[i-j]; : s1 += c*d; 0.94 : 2aa6f8: mla ip, fp, ip, r5 : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; : s0 += c*d; 2.80 : 2aa6fc: mla fp, r2, fp, r4 : d = decoded[i-j]; : s1 += c*d; 1.30 : 2aa700: mla r3, r1, r2, ip 1.26 : 2aa704: bne 2aa61c <flac_lpc_16_c+0xc0> 1.79 : 2aa708: mov r4, r1 0.00 : 2aa70c: mov r5, r7 : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 1.03 : 2aa710: ldr r8, [sp, #48] ; 0x30 0.00 : 2aa714: add r0, r5, #1 1.73 : 2aa718: ldr r9, [sp, #52] ; 0x34 0.00 : 2aa71c: ldr sl, [sp, #68] ; 0x44 1.88 : 2aa720: rsb r1, r5, r8 0.00 : 2aa724: sub r1, r1, #-1073741823 ; 0xc0000001 2.00 : 2aa728: add r0, r9, r0, lsl #2 0.00 : 2aa72c: add r1, sl, r1, lsl #2 : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { : c = coeffs[j]; 8.21 : 2aa730: ldr r2, [r0, #-4]! : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 0.00 : 2aa734: sub r5, r5, #1 3.93 : 2aa738: cmp r5, #0 : c = coeffs[j]; : s0 += c*d; 0.00 : 2aa73c: mla fp, r4, r2, fp : d = decoded[i-j]; 1.77 : 2aa740: ldr r4, [r1, #4]! : s1 += c*d; 9.29 : 2aa744: mla r3, r4, r2, r3 : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 4.04 : 2aa748: bgt 2aa730 <flac_lpc_16_c+0x1d4> : c = coeffs[j]; : s0 += c*d; : d = decoded[i-j]; : s1 += c*d; : } : c = coeffs[0]; 2.89 : 2aa74c: ldr ip, [sp, #52] ; 0x34 : s0 += c*d; : d = decoded[i] += s0 >> qlevel; 0.04 : 2aa750: ldr r8, [sp, #36] ; 0x24 : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.99 : 2aa754: ldr r0, [sp, #48] ; 0x30 : c = coeffs[j]; : s0 += c*d; : d = decoded[i-j]; : s1 += c*d; : } : c = coeffs[0]; 0.07 : 2aa758: ldr r1, [ip] : s0 += c*d; : d = decoded[i] += s0 >> qlevel; 1.17 : 2aa75c: ldr r2, [r8, #-16] 0.09 : 2aa760: pld [r8] 1.32 : 2aa764: ldr ip, [sp, #56] ; 0x38 : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.04 : 2aa768: add r0, r0, #2 1.30 : 2aa76c: ldr r9, [sp, #72] ; 0x48 : s0 += c*d; : d = decoded[i-j]; : s1 += c*d; : } : c = coeffs[0]; : s0 += c*d; 0.02 : 2aa770: mla fp, r4, r1, fp : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.38 : 2aa774: str r0, [sp, #48] ; 0x30 2.96 : 2aa778: ldr sl, [sp, #44] ; 0x2c 0.00 : 2aa77c: cmp r0, r9 : } : c = coeffs[0]; : s0 += c*d; : d = decoded[i] += s0 >> qlevel; : s1 += c*d; : decoded[i+1] += s1 >> qlevel; 1.14 : 2aa780: ldr r0, [r8, #-12] : d = decoded[i-j]; : s1 += c*d; : } : c = coeffs[0]; : s0 += c*d; : d = decoded[i] += s0 >> qlevel; 0.00 : 2aa784: add fp, r2, fp, asr ip 0.04 : 2aa788: add sl, sl, #8 2.42 : 2aa78c: str sl, [sp, #44] ; 0x2c : s1 += c*d; 0.00 : 2aa790: mla r3, fp, r1, r3 : d = decoded[i-j]; : s1 += c*d; : } : c = coeffs[0]; : s0 += c*d; : d = decoded[i] += s0 >> qlevel; 0.00 : 2aa794: str fp, [r8, #-16] : s1 += c*d; : decoded[i+1] += s1 >> qlevel; 2.47 : 2aa798: add r3, r0, r3, asr ip 1.05 : 2aa79c: str r3, [r8, #-12] 2.13 : 2aa7a0: add r8, r8, #8 0.00 : 2aa7a4: str r8, [sp, #36] ; 0x24 : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], : int pred_order, int qlevel, int len) : { : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { 0.94 : 2aa7a8: blt 2aa5dc <flac_lpc_16_c+0x80> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa7ac: ldr r8, [sp, #112] ; 0x70 0.00 : 2aa7b0: ldr r9, [sp, #60] ; 0x3c 0.00 : 2aa7b4: sub r3, r8, #2 0.00 : 2aa7b8: mov sl, r8 0.00 : 2aa7bc: rsb r3, r9, r3 0.00 : 2aa7c0: add r4, r9, #2 0.00 : 2aa7c4: bic r3, r3, #1 0.00 : 2aa7c8: add r4, r4, r3 : s0 += c*d; : d = decoded[i] += s0 >> qlevel; : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { 0.00 : 2aa7cc: cmp sl, r4 0.00 : 2aa7d0: ble 2aa918 <flac_lpc_16_c+0x3bc> : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa7d4: ldr ip, [sp, #60] ; 0x3c 0.00 : 2aa7d8: cmp ip, #0 0.00 : 2aa7dc: ble 2aa948 <flac_lpc_16_c+0x3ec> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa7e0: ldr r1, [sp, #52] ; 0x34 0.00 : 2aa7e4: ubfx r0, r1, #2, #2 0.00 : 2aa7e8: rsb r0, r0, #0 0.00 : 2aa7ec: and r0, r0, #3 0.00 : 2aa7f0: cmp r0, ip 0.00 : 2aa7f4: movcs r0, ip 0.00 : 2aa7f8: cmp ip, #5 0.00 : 2aa7fc: ldrls r0, [sp, #60] ; 0x3c 0.00 : 2aa800: bhi 2aa950 <flac_lpc_16_c+0x3f4> 0.00 : 2aa804: ldr r8, [sp, #68] ; 0x44 0.00 : 2aa808: mov r2, #0 0.00 : 2aa80c: ldr r9, [sp, #52] ; 0x34 0.00 : 2aa810: mov r3, r2 0.00 : 2aa814: add ip, r8, r4, lsl #2 0.00 : 2aa818: sub r1, r9, #4 : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa81c: ldr r5, [ip, #-4]! : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa820: add r3, r3, #1 : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa824: ldr r6, [r1, #4]! 0.00 : 2aa828: cmp r3, r0 0.00 : 2aa82c: mla r2, r6, r5, r2 0.00 : 2aa830: bne 2aa81c <flac_lpc_16_c+0x2c0> 0.00 : 2aa834: ldr sl, [sp, #60] ; 0x3c 0.00 : 2aa838: cmp sl, r3 0.00 : 2aa83c: beq 2aa900 <flac_lpc_16_c+0x3a4> 0.00 : 2aa840: ldr ip, [sp, #60] ; 0x3c 0.00 : 2aa844: rsb r7, r0, ip 0.00 : 2aa848: lsr ip, r7, #2 0.00 : 2aa84c: lsls r6, ip, #2 0.00 : 2aa850: beq 2aa8cc <flac_lpc_16_c+0x370> 0.00 : 2aa854: ldr r9, [sp, #68] ; 0x44 0.00 : 2aa858: lsl r5, r0, #2 0.00 : 2aa85c: ldr sl, [sp, #52] ; 0x34 : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa860: mov r1, #0 0.00 : 2aa864: eor r5, r5, #3 0.00 : 2aa868: vmov.i32 q8, #0 ; 0x00000000 0.00 : 2aa86c: mvn r5, r5 : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa870: vldr d24, [pc, #240] ; 2aa968 <flac_lpc_16_c+0x40c> 0.00 : 2aa874: vldr d25, [pc, #244] ; 2aa970 <flac_lpc_16_c+0x414> 0.00 : 2aa878: add r8, r9, r4, lsl #2 0.00 : 2aa87c: sub r5, r5, #12 0.00 : 2aa880: add r0, sl, r0, lsl #2 0.00 : 2aa884: add r5, r8, r5 0.00 : 2aa888: vld1.32 {d22-d23}, [r5] 0.00 : 2aa88c: add r1, r1, #1 0.00 : 2aa890: cmp r1, ip 0.00 : 2aa894: vldmia r0!, {d18-d19} 0.00 : 2aa898: vtbl.8 d20, {d22-d23}, d24 0.00 : 2aa89c: sub r5, r5, #16 0.00 : 2aa8a0: vtbl.8 d21, {d22-d23}, d25 0.00 : 2aa8a4: vmla.i32 q8, q10, q9 0.00 : 2aa8a8: bcc 2aa888 <flac_lpc_16_c+0x32c> 0.00 : 2aa8ac: vadd.i32 d16, d16, d17 0.00 : 2aa8b0: cmp r7, r6 0.00 : 2aa8b4: vmov.i32 q9, #0 ; 0x00000000 0.00 : 2aa8b8: add r3, r3, r6 0.00 : 2aa8bc: vpadd.i32 d18, d16, d16 0.00 : 2aa8c0: vmov.32 r1, d18[0] 0.00 : 2aa8c4: add r2, r2, r1 0.00 : 2aa8c8: beq 2aa900 <flac_lpc_16_c+0x3a4> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa8cc: ldr ip, [sp, #52] ; 0x34 0.00 : 2aa8d0: sub r0, r3, #-1073741823 ; 0xc0000001 0.00 : 2aa8d4: ldr r8, [sp, #68] ; 0x44 0.00 : 2aa8d8: rsb r1, r3, r4 0.00 : 2aa8dc: ldr r6, [sp, #60] ; 0x3c 0.00 : 2aa8e0: add r0, ip, r0, lsl #2 0.00 : 2aa8e4: add r1, r8, r1, lsl #2 : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa8e8: ldr ip, [r1, #-4]! : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa8ec: add r3, r3, #1 : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa8f0: ldr r5, [r0, #4]! : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa8f4: cmp r6, r3 : sum += coeffs[j] * decoded[i-j-1]; 0.00 : 2aa8f8: mla r2, r5, ip, r2 : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa8fc: bgt 2aa8e8 <flac_lpc_16_c+0x38c> 0.02 : 2aa900: ldr r9, [sp, #56] ; 0x38 0.00 : 2aa904: asr r2, r2, r9 : sum += coeffs[j] * decoded[i-j-1]; : decoded[i] += sum >> qlevel; 0.00 : 2aa908: ldr sl, [sp, #68] ; 0x44 0.00 : 2aa90c: ldr r3, [sl, r4, lsl #2] 0.00 : 2aa910: add r3, r3, r2 0.00 : 2aa914: str r3, [sl, r4, lsl #2] : } : } 0.02 : 2aa918: add sp, sp, #80 ; 0x50 0.00 : 2aa91c: pop {r4, r5, r6, r7, r8, r9, sl, fp} 0.00 : 2aa920: bx lr : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; : for (j = pred_order-1; j > 0; j--) { 0.00 : 2aa924: ldr r5, [sp, #64] ; 0x40 0.36 : 2aa928: mov fp, r3 0.00 : 2aa92c: b 2aa710 <flac_lpc_16_c+0x1b4> : int i, j; : : for (i = pred_order; i < len - 1; i += 2) { : int c; : int d = decoded[i-pred_order]; : int s0 = 0, s1 = 0; 0.00 : 2aa930: mov r3, #0 0.00 : 2aa934: mov fp, r3 0.00 : 2aa938: b 2aa74c <flac_lpc_16_c+0x1f0> 0.00 : 2aa93c: mov r4, r2 0.00 : 2aa940: ldr sl, [sp, #112] ; 0x70 0.00 : 2aa944: b 2aa7cc <flac_lpc_16_c+0x270> : s1 += c*d; : decoded[i+1] += s1 >> qlevel; : } : if (i < len) { : int sum = 0; : for (j = 0; j < pred_order; j++) 0.00 : 2aa948: mov r2, #0 0.00 : 2aa94c: b 2aa908 <flac_lpc_16_c+0x3ac> 0.00 : 2aa950: cmp r0, #0 0.00 : 2aa954: bne 2aa804 <flac_lpc_16_c+0x2a8> : : #undef SAMPLE_SIZE : #define SAMPLE_SIZE 32 : #include "flacdsp_template.c" : : static void flac_lpc_16_c(int32_t *decoded, const int coeffs[32], 0.00 : 2aa958: mov r2, r0 0.00 : 2aa95c: mov r3, r0 0.00 : 2aa960: b 2aa840 <flac_lpc_16_c+0x2e4> 0.00 : 2aa964: nop {0} 0.00 : 2aa968: .word 0x0f0e0d0c 0.00 : 2aa96c: .word 0x0b0a0908 0.00 : 2aa970: .word 0x07060504 0.00 : 2aa974: .word 0x03020100
linaro-toolchain@lists.linaro.org