Re: [PATCH] locking/atomic: Make test_and_*_bit() ordered on failure

16 Aug 2022

On Tue, Aug 16, 2022 at 7:38 PM Will Deacon will@kernel.org wrote:
...
On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
...
On 16/08/2022 23.04, Will Deacon wrote:
...
...

diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
index 3096f086b5a3..71ab4ba9c25d 100644
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
   unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);

if (READ_ONCE(*p) & mask)
     return 1;


old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
 return !!(old & mask);

}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
   unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);

if (!(READ_ONCE(*p) & mask))
     return 0;


old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
 return !!(old & mask);

I suppose one sad thing about this is that, on arm64, we could reasonably
keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return
but I don't think we can express that in the Linux memory model so we
end up in RmW territory every time.
You'd need a barrier *before* the READ_ONCE(), since what we're trying
to prevent is a consumer from writing to the value without being able to
observe the writes that happened prior, while this side read the old
value. A barrier after the READ_ONCE() doesn't do anything, as that read
is the last memory operation in this thread (of the problematic sequence).
Right, having gone back to your litmus test, I now realise it's the "SB"
shape from the memory ordering terminology. It's funny because the arm64
acquire/release instructions are RCsc and so upgrading the READ_ONCE()
to an *arm64* acquire instruction would work for your specific case, but
only because the preceeding store is a release.
...
At that point, I'm not sure DMB LD / early read / LSE atomic would be
any faster than just always doing the LSE atomic?
It depends a lot on the configuration of the system and the state of the
relevant cacheline, but generally avoiding an RmW by introducing a barrier
is likely to be a win. It just gets ugly here as we'd want to avoid the
DMB in the case where we end up doing the RmW. Possibly we could do
something funky like a test-and-test-and-test-and-set (!) where we do
the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but
even just typing that is horrible and I'd _absolutely_ want to see perf
numbers to show that it's a benefit once you start taking into account
things like branch prediction.
Anywho, since Linus has applied the patch and it should work, this is
just an interesting aside.
Will
It is moot if Linus has already taken the patch, but with a stock
kernel config I am
still seeing a slight performance dip but only ~1-2% in the specific
tests I was running.
Sorry about the noise I will need to look at my kernel builder and see what went
wrong when I have more time.
Cheers,
Jon

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] locking/atomic: Make test_and_*_bit() ordered on failure