microoptimising atomic memory ops - linaro-toolchain

24 Nov 2010


      This wiki page came up during the toolchain call:
https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap
as including a push {r4} / pop {r4} pair because it uses too many
temporaries to fit them all in callee-saves registers. I think you
can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new);
# if the current value of *mem is old, then write new into *mem
# r0: mem, r1 old, r2 new
        mov     r3, r0       # move r0 into r3
        dmb     sy           # full memory barrier
        .LSYT7:
        ldrex   r0, [r3]     # load (exclusive) from memory pointed to
by r3 into r0
        cmp     r0, r1       # compare contents of r0 (mem) with r1
(old) -> updates the condition flag
        bne     .LSYB7       # branch to LSYB7 if mem != old
        # This strex trashes the r0 we just loaded, but since we didn't take
        # the branch we know that r0 == r1
        strex   r0, r2, [r3] # store r2 (new) into  memory pointed to
by r3 (mem)
                             # r0 contains 0 if the store was
successful, otherwise 1
        teq     r0, #0       # compares contents of r0 with zero ->
updates the condition flag
        bne     .LSYT7       # branch to LSYT7 if r0 != 0 (if the
store wasn't successful)
        # Move the value that was in memory into the right register to return it
        mov     r0, r1
        dmb     sy           # full memory barrier
        .LSYB7:
        bx      lr           # return
I think you can do a similar trick with __sync_fetch_and_add
(although you have to use a subtract to regenerate r0 from
r1 and r2).
On the other hand I just looked at the gcc code that does this
and it's not simply dumping canned sequences out to the
assembler, so maybe it's not worth the effort just to drop a
stack push/pop.
-- PMM