This wiki page came up during the toolchain call: https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap as including a push {r4} / pop {r4} pair because it uses too many temporaries to fit them all in callee-saves registers. I think you can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new); # if the current value of *mem is old, then write new into *mem # r0: mem, r1 old, r2 new mov r3, r0 # move r0 into r3 dmb sy # full memory barrier .LSYT7: ldrex r0, [r3] # load (exclusive) from memory pointed to by r3 into r0 cmp r0, r1 # compare contents of r0 (mem) with r1 (old) -> updates the condition flag bne .LSYB7 # branch to LSYB7 if mem != old # This strex trashes the r0 we just loaded, but since we didn't take # the branch we know that r0 == r1 strex r0, r2, [r3] # store r2 (new) into memory pointed to by r3 (mem) # r0 contains 0 if the store was successful, otherwise 1 teq r0, #0 # compares contents of r0 with zero -> updates the condition flag bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the store wasn't successful) # Move the value that was in memory into the right register to return it mov r0, r1 dmb sy # full memory barrier .LSYB7: bx lr # return
I think you can do a similar trick with __sync_fetch_and_add (although you have to use a subtract to regenerate r0 from r1 and r2).
On the other hand I just looked at the gcc code that does this and it's not simply dumping canned sequences out to the assembler, so maybe it's not worth the effort just to drop a stack push/pop.
-- PMM
(I've logged this as a potential speed improvement at LP: #681138 so we don't lose it)
-- Michael
On Thu, Nov 25, 2010 at 8:29 AM, Peter Maydell peter.maydell@linaro.org wrote:
This wiki page came up during the toolchain call: https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap as including a push {r4} / pop {r4} pair because it uses too many temporaries to fit them all in callee-saves registers. I think you can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new); # if the current value of *mem is old, then write new into *mem # r0: mem, r1 old, r2 new mov r3, r0 # move r0 into r3 dmb sy # full memory barrier .LSYT7: ldrex r0, [r3] # load (exclusive) from memory pointed to by r3 into r0 cmp r0, r1 # compare contents of r0 (mem) with r1 (old) -> updates the condition flag bne .LSYB7 # branch to LSYB7 if mem != old # This strex trashes the r0 we just loaded, but since we didn't take # the branch we know that r0 == r1 strex r0, r2, [r3] # store r2 (new) into memory pointed to by r3 (mem) # r0 contains 0 if the store was successful, otherwise 1 teq r0, #0 # compares contents of r0 with zero -> updates the condition flag bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the store wasn't successful) # Move the value that was in memory into the right register to return it mov r0, r1 dmb sy # full memory barrier .LSYB7: bx lr # return
I think you can do a similar trick with __sync_fetch_and_add (although you have to use a subtract to regenerate r0 from r1 and r2).
On the other hand I just looked at the gcc code that does this and it's not simply dumping canned sequences out to the assembler, so maybe it's not worth the effort just to drop a stack push/pop.
-- PMM
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
On Wednesday, November 24, 2010 8:29:35 pm Peter Maydell wrote:
This wiki page came up during the toolchain call: https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
The page was just moved to: https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations
strex r0, r2, [r3] # store r2 (new) into memory pointed to
by r3 (mem)
Initially I thought r2 could be used for the result but strex doesn't allow the return register to be the same as the store register. But using r0 instead is a good idea.
Thanks! Ken
On 24 November 2010 21:18, Ken Werner ken.werner@linaro.org wrote:
On Wednesday, November 24, 2010 8:29:35 pm Peter Maydell wrote:
strex r0, r2, [r3] # store r2 (new) into memory pointed to by r3 (mem)
(Apologies for the linewrap damage, by the way -- blame google mail.)
Initially I thought r2 could be used for the result but strex doesn't allow the return register to be the same as the store register.
You can't put the result r2 anyway because you're going to need r2's current value again if the strex fails and you have to loop back to retry the load...
(I think the push/pop is unavoidable for the sync_fetch_and_foo where foo is a non-reversible operation like or/and.)
-- PMM
On Wednesday, November 24, 2010 8:29:35 pm Peter Maydell wrote:
This wiki page came up during the toolchain call: https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap as including a push {r4} / pop {r4} pair because it uses too many temporaries to fit them all in callee-saves registers. I think you can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new); # if the current value of *mem is old, then write new into *mem # r0: mem, r1 old, r2 new mov r3, r0 # move r0 into r3 dmb sy # full memory barrier .LSYT7: ldrex r0, [r3] # load (exclusive) from memory pointed to by r3 into r0 cmp r0, r1 # compare contents of r0 (mem) with r1 (old) -> updates the condition flag bne .LSYB7 # branch to LSYB7 if mem != old # This strex trashes the r0 we just loaded, but since we didn't take # the branch we know that r0 == r1 strex r0, r2, [r3] # store r2 (new) into memory pointed to by r3 (mem) # r0 contains 0 if the store was successful, otherwise 1 teq r0, #0 # compares contents of r0 with zero -> updates the condition flag bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the store wasn't successful) # Move the value that was in memory into the right register to return it mov r0, r1 dmb sy # full memory barrier .LSYB7: bx lr # return
I think you can do a similar trick with __sync_fetch_and_add (although you have to use a subtract to regenerate r0 from r1 and r2).
On the other hand I just looked at the gcc code that does this and it's not simply dumping canned sequences out to the assembler, so maybe it's not worth the effort just to drop a stack push/pop.
Hi,
Attached is a small GCC patch that attempts to optimize the __sync_* builtins as described above. Since "or" and "(n)and" are non-reversible the corresponding builtins still need the push/pop instructions. Any suggestions or comments are welcome.
Regards Ken
linaro-toolchain@lists.linaro.org