Re: microoptimising atomic memory ops

29 Nov 2010


      On Wednesday, November 24, 2010 8:29:35 pm Peter Maydell wrote:
...
This wiki page came up during the toolchain call:
https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap
as including a push {r4} / pop {r4} pair because it uses too many
temporaries to fit them all in callee-saves registers. I think you
can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new);
# if the current value of *mem is old, then write new into *mem
# r0: mem, r1 old, r2 new
        mov     r3, r0       # move r0 into r3
        dmb     sy           # full memory barrier
        .LSYT7:
        ldrex   r0, [r3]     # load (exclusive) from memory pointed to
by r3 into r0
        cmp     r0, r1       # compare contents of r0 (mem) with r1
(old) -> updates the condition flag
        bne     .LSYB7       # branch to LSYB7 if mem != old
        # This strex trashes the r0 we just loaded, but since we didn't
take # the branch we know that r0 == r1
        strex   r0, r2, [r3] # store r2 (new) into  memory pointed to
by r3 (mem)
                             # r0 contains 0 if the store was
successful, otherwise 1
        teq     r0, #0       # compares contents of r0 with zero ->
updates the condition flag
        bne     .LSYT7       # branch to LSYT7 if r0 != 0 (if the
store wasn't successful)
        # Move the value that was in memory into the right register to
return it mov     r0, r1
        dmb     sy           # full memory barrier
        .LSYB7:
        bx      lr           # return
I think you can do a similar trick with __sync_fetch_and_add
(although you have to use a subtract to regenerate r0 from
r1 and r2).
On the other hand I just looked at the gcc code that does this
and it's not simply dumping canned sequences out to the
assembler, so maybe it's not worth the effort just to drop a
stack push/pop.
Hi,
Attached is a small GCC patch that attempts to optimize the __sync_* builtins 
as described above. Since "or" and "(n)and" are non-reversible the 
corresponding builtins still need the push/pop instructions.
Any suggestions or comments are welcome.
Regards
Ken

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: microoptimising atomic memory ops