On Tue, Aug 16, 2022 at 7:38 PM Will Deacon will@kernel.org wrote:
On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
On 16/08/2022 23.04, Will Deacon wrote:
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
} @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
I suppose one sad thing about this is that, on arm64, we could reasonably keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return but I don't think we can express that in the Linux memory model so we end up in RmW territory every time.
You'd need a barrier *before* the READ_ONCE(), since what we're trying to prevent is a consumer from writing to the value without being able to observe the writes that happened prior, while this side read the old value. A barrier after the READ_ONCE() doesn't do anything, as that read is the last memory operation in this thread (of the problematic sequence).
Right, having gone back to your litmus test, I now realise it's the "SB" shape from the memory ordering terminology. It's funny because the arm64 acquire/release instructions are RCsc and so upgrading the READ_ONCE() to an *arm64* acquire instruction would work for your specific case, but only because the preceeding store is a release.
At that point, I'm not sure DMB LD / early read / LSE atomic would be any faster than just always doing the LSE atomic?
It depends a lot on the configuration of the system and the state of the relevant cacheline, but generally avoiding an RmW by introducing a barrier is likely to be a win. It just gets ugly here as we'd want to avoid the DMB in the case where we end up doing the RmW. Possibly we could do something funky like a test-and-test-and-test-and-set (!) where we do the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but even just typing that is horrible and I'd _absolutely_ want to see perf numbers to show that it's a benefit once you start taking into account things like branch prediction.
Anywho, since Linus has applied the patch and it should work, this is just an interesting aside.
Will
It is moot if Linus has already taken the patch, but with a stock kernel config I am still seeing a slight performance dip but only ~1-2% in the specific tests I was running. Sorry about the noise I will need to look at my kernel builder and see what went wrong when I have more time.
Cheers, Jon