Atomic builtins questions

List overview All Threads
Download

newer

older

[ACTIVITY] 26-30 November

ARMv8 load acquire / store release...

Yvan Roux

22 Nov 2012 22 Nov '12

5:42 p.m.

Hi,

I think I have identified some issues with the atomic builtins, but I want your advises. For instance :

A: __atomic_store_n (addr, val, __ATOMIC_SEQ_CST);

gives the armv7 code:

DMB sy STR r1, [r0] DMB sy

but if I have well understood, the DMBs instructions only provide the property that the code is sequentially consistent, but not the atomicity for which we have to use the LDREX/STREX instructions. Thus I think that the code should be :

DMB sy 1: LDREX r2, [r0] STREX r1, r2, [r0] TEQ r1, #0 BNE 1b

B: __atomic_load_n (addr, __ATOMIC_ACQUIRE);

gives the armv7 code:

DMB sy LDR r0, [r0]

but the load-acquire semantique specifies that all loads and stores appearing in program order after the load-acquire will be observed after the load-acquire, thus the DMB should be after the LDR, no ?

-- Yvan

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Ramana Radhakrishnan

22 Nov 22 Nov

8:31 p.m.

Hi Yvan,

There is no issue in the first case. You are correct that the dmb's there are to ensure the sequential consistency as you'd want to see with __ATOMIC_SEQ_CST in the call to the builtin. However what you must remember is that STRs are guaranteed to be single-copy atomic by the architecture on v7-a. In general on v7-a there is single-copy atomicity with loads and stores to byte, half-words (2 byte aligned) and word (4 byte aligned) addresses i.e. ldrb, ldrh and ldr.

On systems with LPAE enabled, ldrd and strd are also guaranteed to be atomic copies from 64 bit aligned addresses, thus in general for v7-a you should be using the ldrexd / strexd variant in this particular case. Thus the code for the first example should work correctly as is and *not* require an ldrex / strex implementation.

The second case always hurts my head - the short story is that you are right and it looks like a bug. The barrier needs to be after the load and not before because `Acquire' semantics imply that no reads in the current thread dependent on the value currently loaded can be reordered before this load. What this means is that loads before this operation can percolate downwards towards the barrier .

The other clue to this is the definition around the `Release' semantics, where no writes in the current thread can be reordered after this store. This implies that the write operation puts out barriers *before* the write which is what we do with __atomic_store_n (addr, 0, __ATOMIC_RELEASE); and that appears to be correct.

However what's not clear to me is why there is this deliberate choice in the default implementation of expand_atomic_load to put out a memory fence before the load and that seems to have percolated back down into the backend for atomic_loaddi. We should take this up upstream as a bug.

Regards, Ramana

P.S. 32 bit v8 would make life oh, so much easier in this area with proper load acquire and store release.

From: linaro-toolchain-bounces@lists.linaro.org [mailto:linaro-toolchain-bounces@lists.linaro.org] On Behalf Of Yvan Roux Sent: 22 November 2012 17:43 To: linaro-toolchain@lists.linaro.org Subject: Atomic builtins questions

Hi,

I think I have identified some issues with the atomic builtins, but I want your advises. For instance :

A: __atomic_store_n (addr, val, __ATOMIC_SEQ_CST);

gives the armv7 code:

DMB sy STR r1, [r0] DMB sy

DMB sy 1: LDREX r2, [r0] STREX r1, r2, [r0] TEQ r1, #0 BNE 1b

B: __atomic_load_n (addr, __ATOMIC_ACQUIRE);

gives the armv7 code:

DMB sy LDR r0, [r0]

-- Yvan

Peter Maydell

9:05 p.m.

On 22 November 2012 20:31, Ramana Radhakrishnan Ramana.Radhakrishnan@arm.com wrote:

...

On systems with LPAE enabled, ldrd and strd are also guaranteed to be atomic copies from 64 bit aligned addresses

Not quite (though this is splitting a rather fine hair). ldrd/strd are 64 bit single-copy atomic for 64 bit aligned addresses on implementations which *include* LPAE, whether LPAE happens to be enabled/in use by the OS or not.

-- PMM

Yvan Roux

23 Nov 23 Nov

10:29 a.m.

Hi Ramana and Peter,

There is no issue in the first case. You are correct that the dmb’s there

...

are to ensure the sequential consistency as you’d want to see with __ATOMIC_SEQ_CST in the call to the builtin. However what you must remember is that STRs are guaranteed to be single-copy atomic by the architecture on v7-a. In general on v7-a there is single-copy atomicity with loads and stores to byte, half-words (2 byte aligned) and word (4 byte aligned) addresses i.e. ldrb, ldrh and ldr.

ok, but if I understand well the ARM ARM atomicity chapter (A3.5.3) the single-copy atomicity is not guarantee for unaligned access, so maybe I missed something, but if it is allowed to call an atomic builtin on an unaligned address, we could have an issue.

On systems with LPAE enabled, ldrd and strd are also guaranteed to be

...

atomic copies from 64 bit aligned addresses, thus in general for v7-a you should be using the ldrexd / strexd variant in this particular case. Thus the code for the first example should work correctly as is and **not** require an ldrex / strex implementation. ****

The second case always hurts my head – the short story is that you are right and it looks like a bug. The barrier needs to be after the load and not before because `Acquire’ semantics imply that no reads in the current thread dependent on the value currently loaded can be reordered before this load. What this means is that loads before this operation can percolate downwards towards the barrier . ****

The other clue to this is the definition around the `Release’ semantics, where no writes in the current thread can be reordered after this store. This implies that the write operation puts out barriers **before** the write which is what we do with __atomic_store_n (addr, 0, __ATOMIC_RELEASE); and that appears to be correct.

I am agree.

...

However what’s not clear to me is why there is this deliberate choice in the default implementation of expand_atomic_load to put out a memory fence before the load and that seems to have percolated back down into the backend for atomic_loaddi. We should take this up upstream as a bug.****

**

It is not clear for me too, I would have write the expand_atomic_load like that :

if (model == MEMMODEL_SEQ_CST) expand_mem_thread_fence (model);

emit_move_insn (target, mem);

if (model == MEMMODEL_SEQ_CST || model == MEMMODEL_ACQUIRE) expand_mem_thread_fence (model);

I'll ask the question upstream.

Yvan

Ramana Radhakrishnan

1:38 p.m.

Yvan,

That is correct. The addresses need to be aligned as per the restrictions in the architecture. Yes we could have an issue but software writers need to deal with it because IIUC you either have a huge performance penalty (x86 / lock prefix) or correctness issues (ARM, Powerpc) .

Cheers, Ramana

From: Yvan Roux [mailto:yvan.roux@linaro.org] Sent: 23 November 2012 10:29 To: Ramana Radhakrishnan Cc: linaro-toolchain@lists.linaro.org Subject: Re: Atomic builtins questions

Hi Ramana and Peter,

I am agree.

It is not clear for me too, I would have write the expand_atomic_load like that :

if (model == MEMMODEL_SEQ_CST) expand_mem_thread_fence (model);

emit_move_insn (target, mem);

if (model == MEMMODEL_SEQ_CST || model == MEMMODEL_ACQUIRE) expand_mem_thread_fence (model);

I'll ask the question upstream.

Yvan

Yvan Roux

25 Nov 25 Nov

7:51 p.m.

...

That is correct. The addresses need to be aligned as per the restrictions in the architecture. Yes we could have an issue but software writers need to deal with it because IIUC you either have a huge performance penalty (x86 / lock prefix) or correctness issues (ARM, Powerpc) .

Ok thanks, I buy the performance argument and I found on the wiki page below that these issues will be treated in the GCC/libatomic in 4.9 release.

http://gcc.gnu.org/wiki/Atomic/GCCMM/UnalignedPolicy

Cheers, Yvan

Ramana Radhakrishnan

29 Nov 29 Nov

10:36 p.m.

...

From: Yvan Roux [yvan.roux@linaro.org] Sent: 25 November 2012 19:51 To: Ramana Radhakrishnan Cc: linaro-toolchain@lists.linaro.org Subject: Re: Atomic builtins questions

...

...
That is correct. The addresses need to be aligned as per the restrictions in the architecture. Yes we could have an issue but software writers need to deal with it because IIUC you either have a huge performance penalty (x86 / lock prefix) or correctness issues (ARM, Powerpc) .

...

Ok thanks, I buy the performance argument and I found on the wiki page below that these issues will be treated in the GCC/libatomic in 4.9 release.

It's a correctness issue on ARM. It won't work till the architecture changes. If it isn't correct, there's no point being fast :) . Taking off list as I wasn't a 100% sure what you meant there.

If you really need something like that you'll need a library call and the library call deals with it by doing an atomic access of the correspondingly rounded up size from an appropriately rounded down address.

regards, Ramana

Yvan Roux

30 Nov 30 Nov

12:53 p.m.

OK, sorry if I wasn't clear. I understand the mechanism with the library call, but what I don't, is why the code below, where the load exclusive establishes the address reservation and we do it until the store exclusive succeed doesn't fix the problem.

1: LDREX r2, [r0] STREX r1, r2, [r0] TEQ r1, #0 BNE 1b

Cheers, Yvan

...

It's a correctness issue on ARM. It won't work till the architecture changes. If it isn't correct, there's no point being fast :) . Taking off list as I wasn't a 100% sure what you meant there.

If you really need something like that you'll need a library call and the library call deals with it by doing an atomic access of the correspondingly rounded up size from an appropriately rounded down address.

regards, Ramana

4706

days inactive

4714

days old

linaro-toolchain@lists.linaro.org

7 comments

participants

tags (0)

participants (3)

Peter Maydell
Ramana Radhakrishnan
Yvan Roux