These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st --- Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------ 2 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 093cdaefdb37..d8b101c97031 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is: - RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE, - otherwise the above rules apply. In the case of test_and_{}_bit() operations, + otherwise the above rules apply. In the case of test_and_set_bit_lock(), if the bit in memory is unchanged by the operation then it is deemed to have failed.
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr); - if (READ_ONCE(*p) & mask) - return 1; - old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask); } @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr); - if (!(READ_ONCE(*p) & mask)) - return 0; - old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask); }
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I assume this gets merged through the locking tree or that Linus picks it up directly, not through my asm-generic tree.
Reviewed-by: Arnd Bergmann arnd@arndb.de
On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I assume this gets merged through the locking tree or that Linus picks it up directly, not through my asm-generic tree.
Reviewed-by: Arnd Bergmann arnd@arndb.de
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Testing this patch on pre Armv8.1 specifically Cortex-A72 and Cortex-A53 cores I am seeing a huge performance drop with this patch applied. Perf is showing lock_is_held_type() as the worst offender but that could just be the function getting blamed. The most obvious indicator of the performance loss is ssh throughput. With the patch I am only able to achieve around 20MB/s and without the patch I am able to transfer around 112MB/s, no other changes.
When I have more time I can do some more in depth testing, but for now I just wanted to bring this issue up so perhaps others can chime in regarding how it performs on their hardware.
-Jon
On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I assume this gets merged through the locking tree or that Linus picks it up directly, not through my asm-generic tree.
Reviewed-by: Arnd Bergmann arnd@arndb.de
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Testing this patch on pre Armv8.1 specifically Cortex-A72 and Cortex-A53 cores I am seeing a huge performance drop with this patch applied. Perf is showing lock_is_held_type() as the worst offender
Hmm, that should only exist if LOCKDEP is enabled and performance tends to go out of the window if you have that on. Can you reproduce the same regression with lockdep disabled?
Will
On Tue, Aug 16, 2022 at 3:01 PM Will Deacon will@kernel.org wrote:
On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I assume this gets merged through the locking tree or that Linus picks it up directly, not through my asm-generic tree.
Reviewed-by: Arnd Bergmann arnd@arndb.de
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Testing this patch on pre Armv8.1 specifically Cortex-A72 and Cortex-A53 cores I am seeing a huge performance drop with this patch applied. Perf is showing lock_is_held_type() as the worst offender
Hmm, that should only exist if LOCKDEP is enabled and performance tends to go out of the window if you have that on. Can you reproduce the same regression with lockdep disabled?
Will
Yep I am working on it. We should note that
config LOCKDEP_SUPPORT def_bool y
is the default for arm64
-Jon
On Tue, 16 Aug 2022 14:05:54 +0100, Jon Nettleton jon@solid-run.com wrote:
On Tue, Aug 16, 2022 at 3:01 PM Will Deacon will@kernel.org wrote:
On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I assume this gets merged through the locking tree or that Linus picks it up directly, not through my asm-generic tree.
Reviewed-by: Arnd Bergmann arnd@arndb.de
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Testing this patch on pre Armv8.1 specifically Cortex-A72 and Cortex-A53 cores I am seeing a huge performance drop with this patch applied. Perf is showing lock_is_held_type() as the worst offender
Hmm, that should only exist if LOCKDEP is enabled and performance tends to go out of the window if you have that on. Can you reproduce the same regression with lockdep disabled?
Will
Yep I am working on it. We should note that
config LOCKDEP_SUPPORT def_bool y
is the default for arm64
Yes, as the architecture supports LOCKDEP. However, you probably have something like CONFIG_PROVE_LOCKING to see such a performance hit (and that's definitely not on by default).
M.
On Tue, Aug 16, 2022 at 10:16:04AM +0200, Arnd Bergmann wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I couldn't figure out parisc -- do you know what ordering their spinlocks provide? They have a comment talking about a release, but I don't know what the ordering guarantees of an "ldcw" are.
Will
On Tue, Aug 16, 2022 at 03:06:41PM +0100, Will Deacon wrote:
On Tue, Aug 16, 2022 at 10:16:04AM +0200, Arnd Bergmann wrote:
On Tue, Aug 16, 2022 at 9:03 AM Hector Martin marcan@marcan.st wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------
I double-checked all the architecture specific implementations to ensure that the asm-generic one is the only one that needs the fix.
I couldn't figure out parisc -- do you know what ordering their spinlocks provide? They have a comment talking about a release, but I don't know what the ordering guarantees of an "ldcw" are.
"The semaphore operation is strongly ordered" (that's from the description of the LDCW instruction)
On Tue, Aug 16, 2022 at 04:03:11PM +0900, Hector Martin wrote:
These operations are documented as always ordered in include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer type use cases where one side needs to ensure a flag is left pending after some shared data was updated rely on this ordering, even in the failure case.
This is the case with the workqueue code, which currently suffers from a reproducible ordering violation on Apple M1 platforms (which are notoriously out-of-order) that ends up causing the TTY layer to fail to deliver data to userspace properly under the right conditions. This change fixes that bug.
Change the documentation to restrict the "no order on failure" story to the _lock() variant (for which it makes sense), and remove the early-exit from the generic implementation, which is what causes the missing barrier semantics in that case. Without this, the remaining atomic op is fully ordered (including on ARM64 LSE, as of recent versions of the architecture spec).
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin marcan@marcan.st
Documentation/atomic_bitops.txt | 2 +- include/asm-generic/bitops/atomic.h | 6 ------ 2 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 093cdaefdb37..d8b101c97031 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
- RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_{}_bit() operations,
- otherwise the above rules apply. In the case of test_and_set_bit_lock(), if the bit in memory is unchanged by the operation then it is deemed to have failed.
The next sentence is:
| Except for a successful test_and_set_bit_lock() which has ACQUIRE | semantics and clear_bit_unlock() which has RELEASE semantics.
so I think it reads a bit strangely now. How about something like:
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 093cdaefdb37..3b516729ec81 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -59,12 +59,15 @@ Like with atomic_t, the rule of thumb is: - RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE, - otherwise the above rules apply. In the case of test_and_{}_bit() operations, - if the bit in memory is unchanged by the operation then it is deemed to have - failed. + otherwise the above rules apply. For the purposes of ordering, the + test_and_{}_bit() operations are treated as unconditional.
-Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and -clear_bit_unlock() which has RELEASE semantics. +Except for: + + - test_and_set_bit_lock() which has ACQUIRE semantics on success and is + unordered on failure; + + - clear_bit_unlock() which has RELEASE semantics.
Since a platform only has a single means of achieving atomic operations the same barriers as for atomic_t are used, see atomic_t.txt.
?
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
} @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
I suppose one sad thing about this is that, on arm64, we could reasonably keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return but I don't think we can express that in the Linux memory model so we end up in RmW territory every time. Perhaps we should roll our own implementation in the arch code?
In any case, this should fix the problem so:
Acked-by: Will Deacon will@kernel.org
Will
On 16/08/2022 23.04, Will Deacon wrote:
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 093cdaefdb37..d8b101c97031 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
- RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_{}_bit() operations,
- otherwise the above rules apply. In the case of test_and_set_bit_lock(), if the bit in memory is unchanged by the operation then it is deemed to have failed.
The next sentence is:
| Except for a successful test_and_set_bit_lock() which has ACQUIRE | semantics and clear_bit_unlock() which has RELEASE semantics.
so I think it reads a bit strangely now. How about something like:
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 093cdaefdb37..3b516729ec81 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -59,12 +59,15 @@ Like with atomic_t, the rule of thumb is:
- RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_{}_bit() operations,
- if the bit in memory is unchanged by the operation then it is deemed to have
- failed.
- otherwise the above rules apply. For the purposes of ordering, the
- test_and_{}_bit() operations are treated as unconditional.
-Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and -clear_bit_unlock() which has RELEASE semantics. +Except for:
- test_and_set_bit_lock() which has ACQUIRE semantics on success and is
- unordered on failure;
- clear_bit_unlock() which has RELEASE semantics.
Since a platform only has a single means of achieving atomic operations the same barriers as for atomic_t are used, see atomic_t.txt.
Makes sense! I'll send a v2 with that in a couple of days if nothing else comes up.
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
} @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
I suppose one sad thing about this is that, on arm64, we could reasonably keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return but I don't think we can express that in the Linux memory model so we end up in RmW territory every time.
You'd need a barrier *before* the READ_ONCE(), since what we're trying to prevent is a consumer from writing to the value without being able to observe the writes that happened prior, while this side read the old value. A barrier after the READ_ONCE() doesn't do anything, as that read is the last memory operation in this thread (of the problematic sequence).
At that point, I'm not sure DMB LD / early read / LSE atomic would be any faster than just always doing the LSE atomic?
- Hector
On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
On 16/08/2022 23.04, Will Deacon wrote:
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
} @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr); p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
I suppose one sad thing about this is that, on arm64, we could reasonably keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return but I don't think we can express that in the Linux memory model so we end up in RmW territory every time.
You'd need a barrier *before* the READ_ONCE(), since what we're trying to prevent is a consumer from writing to the value without being able to observe the writes that happened prior, while this side read the old value. A barrier after the READ_ONCE() doesn't do anything, as that read is the last memory operation in this thread (of the problematic sequence).
Right, having gone back to your litmus test, I now realise it's the "SB" shape from the memory ordering terminology. It's funny because the arm64 acquire/release instructions are RCsc and so upgrading the READ_ONCE() to an *arm64* acquire instruction would work for your specific case, but only because the preceeding store is a release.
At that point, I'm not sure DMB LD / early read / LSE atomic would be any faster than just always doing the LSE atomic?
It depends a lot on the configuration of the system and the state of the relevant cacheline, but generally avoiding an RmW by introducing a barrier is likely to be a win. It just gets ugly here as we'd want to avoid the DMB in the case where we end up doing the RmW. Possibly we could do something funky like a test-and-test-and-test-and-set (!) where we do the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but even just typing that is horrible and I'd _absolutely_ want to see perf numbers to show that it's a benefit once you start taking into account things like branch prediction.
Anywho, since Linus has applied the patch and it should work, this is just an interesting aside.
Will
On Tue, Aug 16, 2022 at 7:38 PM Will Deacon will@kernel.org wrote:
On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
On 16/08/2022 23.04, Will Deacon wrote:
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h index 3096f086b5a3..71ab4ba9c25d 100644 --- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
} @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p) unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
I suppose one sad thing about this is that, on arm64, we could reasonably keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return but I don't think we can express that in the Linux memory model so we end up in RmW territory every time.
You'd need a barrier *before* the READ_ONCE(), since what we're trying to prevent is a consumer from writing to the value without being able to observe the writes that happened prior, while this side read the old value. A barrier after the READ_ONCE() doesn't do anything, as that read is the last memory operation in this thread (of the problematic sequence).
Right, having gone back to your litmus test, I now realise it's the "SB" shape from the memory ordering terminology. It's funny because the arm64 acquire/release instructions are RCsc and so upgrading the READ_ONCE() to an *arm64* acquire instruction would work for your specific case, but only because the preceeding store is a release.
At that point, I'm not sure DMB LD / early read / LSE atomic would be any faster than just always doing the LSE atomic?
It depends a lot on the configuration of the system and the state of the relevant cacheline, but generally avoiding an RmW by introducing a barrier is likely to be a win. It just gets ugly here as we'd want to avoid the DMB in the case where we end up doing the RmW. Possibly we could do something funky like a test-and-test-and-test-and-set (!) where we do the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but even just typing that is horrible and I'd _absolutely_ want to see perf numbers to show that it's a benefit once you start taking into account things like branch prediction.
Anywho, since Linus has applied the patch and it should work, this is just an interesting aside.
Will
It is moot if Linus has already taken the patch, but with a stock kernel config I am still seeing a slight performance dip but only ~1-2% in the specific tests I was running. Sorry about the noise I will need to look at my kernel builder and see what went wrong when I have more time.
Cheers, Jon
On Tue, Aug 16, 2022 at 10:49 AM Jon Nettleton jon@solid-run.com wrote:
It is moot if Linus has already taken the patch, but with a stock kernel config I am still seeing a slight performance dip but only ~1-2% in the specific tests I was running.
It would be interesting to hear if you can pinpoint in the profiles where the time is spent.
It might be some random place that really doesn't care about ordering at all, and then we could easily rewrite _that_ particular case to do the unordered test explicitly, ie something like
- if (test_and_set_bit()) ... + if (test_bit() || test_and_set_bit()) ...
or even introduce an explicitly unordered "test_and_set_bit_relaxed()" thing.
Linus
On Tue, Aug 16, 2022 at 8:02 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Aug 16, 2022 at 10:49 AM Jon Nettleton jon@solid-run.com wrote:
It is moot if Linus has already taken the patch, but with a stock kernel config I am still seeing a slight performance dip but only ~1-2% in the specific tests I was running.
It would be interesting to hear if you can pinpoint in the profiles where the time is spent.
It might be some random place that really doesn't care about ordering at all, and then we could easily rewrite _that_ particular case to do the unordered test explicitly, ie something like
if (test_and_set_bit()) ...
if (test_bit() || test_and_set_bit()) ...
or even introduce an explicitly unordered "test_and_set_bit_relaxed()" thing.
Linus
This is very interesting, the additional performance overhead doesn't seem to be coming from within the kernel but from userspace. Comparing patched and unpatched kernels I am seeing more cycles being taken up by glibc atomics like __aarch64_cas4_acq and __aarch64_ldadd4_acq_rel.
I need to test further to see if there is less effect on a system with less cores, This is a 16-core Cortex-A72, it is possible this is less of an issue on 4 core A72's and A53's.
-Jon
...
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
}
This looks like the same pattern (attempting to avoid a locked bus cycle) that caused the qdisc code to sit on transmit packets (even on x86). That had some barriers in it (possibly nops on x86) that didn't help - although the comments suggested otherwise.
I wonder if the pattern has been used anywhere else?
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linux-stable-mirror@lists.linaro.org