Hi,
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldrb w2, [x3,w2,uxtw] <<< (3) adr x3, .Lrtx4 add x2, x3, w2, sxtb #2 br x2
(3) Why use a byte table, this is not some sort of embedded system. Use a word table and this becomes.
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
An aligned word load takes exactly the same time as a byte load and we save the faffing about calculating the address.
.L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] <<< (4) stp x2, x3, [x1, 32] <<< (4) stp x6, x7, [x1] stp x4, x5, [x1, 16]
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
Unfortunately as this function is performance critical it means I will probably end up doing it in inline assembler which is time consuming, error prone and non portable.
* Whinge mode off
Ed
On 2 March 2016 at 12:35, Edward Nevill edward.nevill@linaro.org wrote:
Hi,
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
Hi Ed,
Thanks for the feedback.
Can you be more specific on the GCC versions you are using? Do you mean Linaro TCWG releases?
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldrb w2, [x3,w2,uxtw] <<< (3) adr x3, .Lrtx4 add x2, x3, w2, sxtb #2 br x2
(3) Why use a byte table, this is not some sort of embedded system. Use a word table and this becomes.
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
An aligned word load takes exactly the same time as a byte load and we save the faffing about calculating the address.
.L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] <<< (4) stp x2, x3, [x1, 32] <<< (4) stp x6, x7, [x1] stp x4, x5, [x1, 16]
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
Unfortunately as this function is performance critical it means I will probably end up doing it in inline assembler which is time consuming, error prone and non portable.
- Whinge mode off
I've just tried with our 5.3-2015.12 snapshot, and all your comments except (4) are still valid.
The same is true with today's trunk. FWIW, (4) now looks like this: .L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] stp x6, x7, [x1] stp x4, x5, [x1, 16] stp x2, x3, [x1, 32]
Christophe
Ed
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
On Mar 2, 2016, at 4:05 PM, Christophe Lyon christophe.lyon@linaro.org wrote:
On 2 March 2016 at 12:35, Edward Nevill edward.nevill@linaro.org wrote:
Hi,
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
Hi Ed,
Thanks for the feedback.
Can you be more specific on the GCC versions you are using? Do you mean Linaro TCWG releases?
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldrb w2, [x3,w2,uxtw] <<< (3) adr x3, .Lrtx4 add x2, x3, w2, sxtb #2 br x2
(3) Why use a byte table, this is not some sort of embedded system. Use a word table and this becomes.
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
An aligned word load takes exactly the same time as a byte load and we save the faffing about calculating the address.
.L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] <<< (4) stp x2, x3, [x1, 32] <<< (4) stp x6, x7, [x1] stp x4, x5, [x1, 16]
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
Unfortunately as this function is performance critical it means I will probably end up doing it in inline assembler which is time consuming, error prone and non portable.
- Whinge mode off
I've just tried with our 5.3-2015.12 snapshot, and all your comments except (4) are still valid.
Edward,
This is very useful information. Would you please copy-paste this into one or more bugzilla entries at bugs.linaro.org? TCWG will then evaluate which of the issues are still present in GCC trunk, and we are going to put them into our optimization pipeline.
The same is true with today's trunk. FWIW, (4) now looks like this: .L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] stp x6, x7, [x1] stp x4, x5, [x1, 16] stp x2, x3, [x1, 32]
Right, this is fixed in GCC 6 (and backported to Linaro 5.3, AFAIK), where compiler tries to sort memory references in order of increasing address to accommodate CPUs with hardware cache auto-prefetcher.
Thank you,
-- Maxim Kuvyrkov www.linaro.org
On 2 March 2016 at 11:35, Edward Nevill edward.nevill@linaro.org wrote:
cmp x2, 8 <<< (1)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
You mean to use "cmp w2, 8" instead? Is there any difference?
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
My reading of the AAPCS64 is that it's not necessary to have a frame at all, only that if you do, it must be quad-word aligned.
Clang/LLVM doesn't seem to bother with the push and pop, but it also uses "cmp x".
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
Hum, this is *exactly* what Clang generates... :)
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
Again, Clang seems to do what you want...
Have you tried building OpenJDK with Clang?
cheers, --renato
On Wed, 2016-03-02 at 14:25 +0000, Renato Golin wrote:
On 2 March 2016 at 11:35, Edward Nevill edward.nevill@linaro.org wrote:
cmp x2, 8 <<< (1)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
You mean to use "cmp w2, 8" instead? Is there any difference?
No. Look at the assembler again
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15
It is doing cmp x2, 8 then a few instructions later, without modifying x2/w2 and without any intervening branch destinations it does cmp w2, 8. I assert the 2nd cmp w2, 8 and bls are redundant, because we know it is (unsigned) <= 8 already.
Have you tried building OpenJDK with Clang?
No. I might if you can provide me a binary.
Thanks, Ed.
On 2 March 2016 at 14:36, Edward Nevill edward.nevill@linaro.org wrote:
It is doing cmp x2, 8 then a few instructions later, without modifying x2/w2 and without any intervening branch destinations it does cmp w2, 8. I assert the 2nd cmp w2, 8 and bls are redundant, because we know it is (unsigned) <= 8 already.
Of course, I missed that. Clang, obviously, doesn't do that. :)
No. I might if you can provide me a binary.
All releases are cross-compilers, so you just need to use "-target aarch64-linux-gnu", possibly set the sysroot for binutils / libraries.
3.8.0 is just around the corner, and will be available in the same page.
cheers, --renato
On 02/03/16 14:25, Renato Golin wrote:
On 2 March 2016 at 11:35, Edward Nevill edward.nevill@linaro.org wrote:
cmp x2, 8 <<< (1)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
You mean to use "cmp w2, 8" instead? Is there any difference?
No, it's code equivalent to
unsigned long x;
if (x <= 8) { if ((unsigned) x <= 8) { ... } }
Where the inner test is clearly redundant (for unsigned).
R.
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
My reading of the AAPCS64 is that it's not necessary to have a frame at all, only that if you do, it must be quad-word aligned.
Clang/LLVM doesn't seem to bother with the push and pop, but it also uses "cmp x".
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
Hum, this is *exactly* what Clang generates... :)
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
Again, Clang seems to do what you want...
Have you tried building OpenJDK with Clang?
cheers, --renato _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 02/03/16 11:35, Edward Nevill wrote:
Hi,
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
Agreed. This could probably be done by the mid-end based on value range propagation. Please can you file a report in gcc bugzilla?
(2) Nowhere in the function does it store anything on the stack, so why drop and restore the stack every time. Also, minor quibble in the disass, why does sub use #64 whereas add uses just '64' (appreciate this is probably binutils, not gcc).
This is a known problem. What's happened is that in the early phase of compilation you had an object that appeared to need stack space. Later on that was optimized away, but the stack slot is not freed. In large functions where there is often other data on the stack anyway this equates to little more than some wasted stack space, but in small functions it can often make the difference between needing stack adjustments and not.
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldrb w2, [x3,w2,uxtw] <<< (3) adr x3, .Lrtx4 add x2, x3, w2, sxtb #2 br x2
(3) Why use a byte table, this is not some sort of embedded system. Use a word table and this becomes.
.L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldr x2, [x3, x2, lsl #3] br x2
An aligned word load takes exactly the same time as a byte load and we save the faffing about calculating the address.
That doesn't work for PIC (or PIE) and can also significantly increase cache pressure.
.L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] <<< (4) stp x2, x3, [x1, 32] <<< (4) stp x6, x7, [x1] stp x4, x5, [x1, 16]
(4) Seems to be something wrong with the load scheduler here? Why not move the stp x2, x3 to the end. It does this repeatedly.
You don't say what compilation options you used but a simple build with -O3 on gcc trunk shows the stores in the correct order.
Unfortunately as this function is performance critical it means I will probably end up doing it in inline assembler which is time consuming, error prone and non portable.
- Whinge mode off
Ed
copy.c
#include <stddef.h>
typedef long HeapWord;
extern void _Copy_disjoint_words(HeapWord* from, HeapWord* to, size_t count);
void pd_disjoint_words(HeapWord* from, HeapWord* to, size_t count) { switch (count) { case 0: return; case 1: to[0] = from[0]; return; case 2: { struct unit { HeapWord a, b; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 3: { struct unit { HeapWord a, b, c; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 4: { struct unit { HeapWord a, b, c, d; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 5: { struct unit { HeapWord a, b, c, d, e; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 6: { struct unit { HeapWord a, b, c, d, e, f; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 7: { struct unit { HeapWord a, b, c, d, e, f, g; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } case 8: { struct unit { HeapWord a, b, c, d, e, f, g, h; } *p, *q, t; p = (struct unit *)from; q = (struct unit *)to; t = *p; *q = t; return; } default: _Copy_disjoint_words(from, to, count); } }
copy.s
.cpu generic+fp+simd .file "copy.c" .text .align 2 .p2align 3,,7 .global pd_disjoint_words .type pd_disjoint_words, %function
pd_disjoint_words: cmp x2, 8 sub sp, sp, #64 bhi .L2 cmp w2, 8 bls .L15 .L2: add sp, sp, 64 b _Copy_disjoint_words .p2align 3 .L15: adrp x3, .L4 add x3, x3, :lo12:.L4 ldrb w2, [x3,w2,uxtw] adr x3, .Lrtx4 add x2, x3, w2, sxtb #2 br x2 .Lrtx4: .section .rodata .align 0 .align 2 .L4: .byte (.L1 - .Lrtx4) / 4 .byte (.L5 - .Lrtx4) / 4 .byte (.L6 - .Lrtx4) / 4 .byte (.L7 - .Lrtx4) / 4 .byte (.L8 - .Lrtx4) / 4 .byte (.L9 - .Lrtx4) / 4 .byte (.L10 - .Lrtx4) / 4 .byte (.L11 - .Lrtx4) / 4 .byte (.L12 - .Lrtx4) / 4 .text .p2align 3 .L10: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] stp x2, x3, [x1, 32] stp x6, x7, [x1] stp x4, x5, [x1, 16] .L1: add sp, sp, 64 ret .p2align 3 .L11: ldp x6, x7, [x0] ldp x4, x5, [x0, 16] ldp x2, x3, [x0, 32] ldr x0, [x0, 48] str x0, [x1, 48] stp x6, x7, [x1] stp x4, x5, [x1, 16] stp x2, x3, [x1, 32] add sp, sp, 64 ret .p2align 3 .L9: ldp x4, x5, [x0] ldp x2, x3, [x0, 16] ldr x0, [x0, 32] str x0, [x1, 32] stp x4, x5, [x1] stp x2, x3, [x1, 16] add sp, sp, 64 ret .p2align 3 .L8: ldp x4, x5, [x0] ldp x2, x3, [x0, 16] stp x2, x3, [x1, 16] stp x4, x5, [x1] add sp, sp, 64 ret .p2align 3 .L7: ldp x2, x3, [x0] ldr x0, [x0, 16] str x0, [x1, 16] stp x2, x3, [x1] add sp, sp, 64 ret .p2align 3 .L6: ldr q0, [x0] str q0, [x1] add sp, sp, 64 ret .p2align 3 .L5: ldr x0, [x0] str x0, [x1] add sp, sp, 64 ret .p2align 3 .L12: ldp x8, x9, [x0] ldp x6, x7, [x0, 16] ldp x4, x5, [x0, 32] ldp x2, x3, [x0, 48] stp x2, x3, [x1, 48] stp x8, x9, [x1] stp x6, x7, [x1, 16] stp x4, x5, [x1, 32] add sp, sp, 64 ret .size pd_disjoint_words, .-pd_disjoint_words .ident "GCC: (GNU) 5.2.0" .section .note.GNU-stack,"",%progbits
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
Agreed. This could probably be done by the mid-end based on value range propagation. Please can you file a report in gcc bugzilla?
Not sure how we can do this in VRP. It seems that this is generated during the RTL expansion time. Maybe,it has to be done during expansion. optimized tree looks like:
;; Function pd_disjoint_words (pd_disjoint_words, funcdef_no=0, decl_uid=2763, cgraph_uid=0, symbol_order=0)
Removing basic block 13 pd_disjoint_words (HeapWord * from, HeapWord * to, size_t count) { long int t$b; long int t$a; struct unit t; struct unit t; struct unit t; struct unit t; struct unit t; struct unit t; long int _5;
<bb 2>: switch (count_2(D)) <default: <L16>, case 0: <L18>, case 1: <L1>, case 2: <L2>, case 3: <L4>, case 4: <L6>, case 5: <L8>, case 6: <L10>, case 7: <L12>, case 8: <L14>>
<L1>: _5 = *from_4(D); *to_6(D) = _5; goto <bb 12> (<L18>);
<L2>: t$a_8 = MEM[(struct unit *)from_4(D)]; t$b_9 = MEM[(struct unit *)from_4(D) + 8B]; MEM[(struct unit *)to_6(D)] = t$a_8; MEM[(struct unit *)to_6(D) + 8B] = t$b_9; goto <bb 12> (<L18>);
<L4>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L6>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L8>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L10>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L12>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L14>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L16>: _Copy_disjoint_words (from_4(D), to_6(D), count_2(D)); [tail call]
<L18>: return;
}
Thanks, Kugan
On 03/03/16 00:44, kugan wrote:
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words: cmp x2, 8 <<< (1) sub sp, sp, #64 <<< (2) bhi .L2 cmp w2, 8 <<< (1) bls .L15 .L2: add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
Agreed. This could probably be done by the mid-end based on value range propagation. Please can you file a report in gcc bugzilla?
Not sure how we can do this in VRP. It seems that this is generated during the RTL expansion time. Maybe,it has to be done during expansion. optimized tree looks like:
Ramana and I looked further into thsi last night. It turns out this is due to the way we expand switch tables. The ARM and AArch64 back-ends both use the casesi pattern which is defined to do a range check and a branch into the table. The range check is based on a 32-bit value.
Because this example uses a 64-bit type as the controlling expression, the mid-end has to insert another check that the original value is within range; this renders the second check redundant but there's then no way to remove that. You're correct that VRP isn't going to help here.
We're looking at whether we can adjust things to use the tablejump expander, since that should eliminate the need for the second check.
;; Function pd_disjoint_words (pd_disjoint_words, funcdef_no=0, decl_uid=2763, cgraph_uid=0, symbol_order=0)
Removing basic block 13 pd_disjoint_words (HeapWord * from, HeapWord * to, size_t count) { long int t$b; long int t$a; struct unit t; struct unit t; struct unit t; struct unit t; struct unit t; struct unit t; long int _5;
<bb 2>: switch (count_2(D)) <default: <L16>, case 0: <L18>, case 1: <L1>, case 2: <L2>, case 3: <L4>, case 4: <L6>, case 5: <L8>, case 6: <L10>, case 7: <L12>, case 8: <L14>>
<L1>: _5 = *from_4(D); *to_6(D) = _5; goto <bb 12> (<L18>);
<L2>: t$a_8 = MEM[(struct unit *)from_4(D)]; t$b_9 = MEM[(struct unit *)from_4(D) + 8B]; MEM[(struct unit *)to_6(D)] = t$a_8; MEM[(struct unit *)to_6(D) + 8B] = t$b_9; goto <bb 12> (<L18>);
<L4>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L6>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L8>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L10>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L12>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L14>: t = MEM[(struct unit *)from_4(D)]; MEM[(struct unit *)to_6(D)] = t; t ={v} {CLOBBER}; goto <bb 12> (<L18>);
<L16>: _Copy_disjoint_words (from_4(D), to_6(D), count_2(D)); [tail call]
<L18>: return;
}
Thanks, Kugan
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
linaro-toolchain@lists.linaro.org