The work queueing operation relies on the atomic test-and-set operation to set the PENDING bit behaving as a full barrier, even if it fails. Otherwise, the PENDING state may be observed before memory writes pertaining to the work complete, as they are allowed to be reordered. That can lead to work being processed before all prior writes are observable, and no new work being queued to ensure they are observed at some point.
This has been broken since the dawn of time, and it was incompletely fixed by 346c09f80459, which added the necessary barriers in the work execution path but failed to account for the missing barrier in the test_and_set_bit() failure case. Fix it by switching to atomic_long_fetch_or(), which does have unconditional barrier semantics regardless of whether the bit was already set or not (this is actually just test_and_set_bit() minus the early exit path).
Discovered [1] on Apple M1 platforms, which are ridiculously out-of-order and managed to trigger this in the TTY core, of all places. Easily reproducible by running this m1n1 client script on one M1 machine connected to another one running the m1n1 bootloader in proxy mode:
============= from m1n1.setup import *
i = 0 while True: a = iface.readmem(u.base, 1170) print(i) i += 1 =============
The script will hang when the TTY layer fails to push a buffer of data into the ldisc in a timely manner in tty_flip_buffer_push(), which writes a buffer pointer and then queue_work()s the ldisc push.
(Note: reproducibility depends on .config options)
Additionally, properly document that queue_work() has guarantees even when work is already queued (it doesn't make any sense for it not to, the comment in set_work_pool_and_clear_pending() already implies it does, and the TTY core and probably quite a few other places rely on this).
[1] https://lore.kernel.org/lkml/6c089268-4f2c-9fdf-7bcb-107b611fbc21@marcan.st/...
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: 346c09f80459 ("workqueue: fix ghost PENDING flag while doing MQ IO") Cc: stable@vger.kernel.org Signed-off-by: Hector Martin marcan@marcan.st --- include/linux/workqueue.h | 15 ++++++++++----- kernel/workqueue.c | 39 +++++++++++++++++++++++++++++++-------- 2 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index a0143dd24430..d9ea73813a3c 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -484,18 +484,23 @@ extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task); * We queue the work to the CPU on which it was submitted, but if the CPU dies * it can be processed by another CPU. * - * Memory-ordering properties: If it returns %true, guarantees that all stores - * preceding the call to queue_work() in the program order will be visible from - * the CPU which will execute @work by the time such work executes, e.g., + * Memory-ordering properties: Guarantees that all stores preceding the call to + * queue_work() in the program order will be visible from the CPU which will + * execute @work by the time such work executes, e.g., * * { x is initially 0 } * * CPU0 CPU1 * * WRITE_ONCE(x, 1); [ @work is being executed ] - * r0 = queue_work(wq, work); r1 = READ_ONCE(x); + * queue_work(wq, work); r0 = READ_ONCE(x); * - * Forbids: r0 == true && r1 == 0 + * Forbids: r0 == 0 for the currently pending execution of @work after + * queue_work() completes. + * + * If @work was already pending (ret == false), that execution is guaranteed + * to observe x == 1. If @work was newly queued (ret == true), the newly + * queued execution is guaranteed to observe x == 1. */ static inline bool queue_work(struct workqueue_struct *wq, struct work_struct *work) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index aeea9731ef80..01bc03eed649 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -655,7 +655,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, { /* * The following wmb is paired with the implied mb in - * test_and_set_bit(PENDING) and ensures all updates to @work made + * atomic_long_fetch_or(PENDING) and ensures all updates to @work made * here are visible to and precede any updates by the next PENDING * owner. */ @@ -673,7 +673,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, * * 1 STORE event_indicated * 2 queue_work_on() { - * 3 test_and_set_bit(PENDING) + * 3 atomic_long_fetch_or(PENDING) * 4 } set_..._and_clear_pending() { * 5 set_work_data() # clear bit * 6 smp_mb() @@ -688,6 +688,15 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, * finish the queued @work. Meanwhile CPU#1 does not see * event_indicated is set, because speculative LOAD was executed * before actual STORE. + * + * Line 3 requires barrier semantics, even on failure. If it were + * implemented with test_and_set_bit() (which does not have + * barrier semantics on failure), that would allow the STORE to + * be reordered after it, and it could be observed by CPU#1 after + * it has executed all the way through to line 8 (and cleared the + * PENDING bit in the process). At this point, CPU#0 would not have + * queued new work (having observed PENDING set), and CPU#1 would not + * have observed the event_indicated store in the last work execution. */ smp_mb(); } @@ -1276,8 +1285,9 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork, return 1; }
- /* try to claim PENDING the normal way */ - if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) + /* try to claim PENDING the normal way, see queue_work_on() */ + if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data) + & WORK_STRUCT_PENDING)) return 0;
rcu_read_lock(); @@ -1541,7 +1551,14 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + /* + * We need unconditional barrier semantics, even on failure, + * to avoid racing set_work_pool_and_clear_pending(). Hence, + * this has to be atomic_long_fetch_or(), not test_and_set_bit() + * which elides the barrier on failure. + */ + if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data) + & WORK_STRUCT_PENDING)) { __queue_work(cpu, wq, work); ret = true; } @@ -1623,7 +1640,9 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + /* see queue_work_on() */ + if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data) + & WORK_STRUCT_PENDING)) { int cpu = workqueue_select_cpu_near(node);
__queue_work(cpu, wq, work); @@ -1697,7 +1716,9 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, /* read the comment in __queue_work() */ local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + /* see queue_work_on() */ + if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data) + & WORK_STRUCT_PENDING)) { __queue_delayed_work(cpu, wq, dwork, delay); ret = true; } @@ -1769,7 +1790,9 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork) { struct work_struct *work = &rwork->work;
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + /* see queue_work_on() */ + if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data) + & WORK_STRUCT_PENDING)) { rwork->wq = wq; call_rcu(&rwork->rcu, rcu_work_rcufn); return true; -- 2.35.1
On Tue, Aug 16, 2022 at 02:58:10AM +0900, Hector Martin wrote:
This has been broken since the dawn of time, and it was incompletely fixed by 346c09f80459, which added the necessary barriers in the work execution path but failed to account for the missing barrier in the test_and_set_bit() failure case. Fix it by switching to atomic_long_fetch_or(), which does have unconditional barrier semantics regardless of whether the bit was already set or not (this is actually just test_and_set_bit() minus the early exit path).
...
Oh, tricky one and yeah you're absolutely right that it makes no sense to not guarantee barrier semantics when already pending. I didn't even know test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting down and fixing this. Applied to wq/for-6.0-fixes.
Thanks.
Tejun Heo tj@kernel.org wrote:
Oh, tricky one and yeah you're absolutely right that it makes no sense to not guarantee barrier semantics when already pending. I didn't even know test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting down and fixing this. Applied to wq/for-6.0-fixes.
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Cheers,
On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu herbert@gondor.apana.org.au wrote:
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Yes, the bitops are kind of strange for various legacy reasons:
- set/clear_bit is atomic, but without a memory barrier, and need a "smp_mb__before_atomic()" or similar for barriers
- test_and_set/clear_bit() are atomic, _and_ are memory barriers
- test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and _weaker_ than full memory barriers, but sufficient for locking (ie acquire/release)
Does any of this make any sense at all? No. But those are the documented semantics exactly because people were worried about test_and_set_bit being used for locking, since on x86 all the atomics are also memory barriers.
From looking at it, the asm-generic implementation is a bit questionable, though. In particular, it does
if (READ_ONCE(*p) & mask) return 1;
so it's *entirely* unordered for the "bit was already set" case.
That looks very wrong to me, since it basically means that the test_and_set_bit() can return "bit was already set" based on an old value - not a memory barrier at all.
So if you use "test_and_set_bit()" as some kind of "I've done my work, now I am going to set the bit to tell people to pick it up", then that early "bit was already set" code completely breaks it.
Now, arguably our atomic bitop semantics are very very odd, and it might be time to revisit them. But that code looks very very buggy to me.
The bug seems to go back to commit e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the fix looks to be as simple as just removing that early READ_ONCE return case (test_and_clear has the same bug).
Will?
IOW, the proper fix for this should, I think, look something like this (whitespace mangled) thing:
--- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit( unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr); - if (READ_ONCE(*p) & mask) - return 1; - old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask); } @@ -53,9 +50,6 @@ arch_test_and_clear_bit unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr); - if (!(READ_ONCE(*p) & mask)) - return 0; - old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask); }
but the above is not just whitespace-damaged, it's entirely untested and based purely on me looking at that code.
Linus
On 16/08/2022 14.27, Linus Torvalds wrote:
On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu herbert@gondor.apana.org.au wrote:
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
<snip>
From looking at it, the asm-generic implementation is a bit questionable, though. In particular, it does
if (READ_ONCE(*p) & mask) return 1;
so it's *entirely* unordered for the "bit was already set" case.
These ops are documented in Documentation/atomic_bitops.txt as being unordered in the failure ("bit was already set" case), and that matches the generic implementation (which arm64 uses).
On the other hand, Twitter just pointed out that contradicting documentation exists (I believe this was the source of the third party doc I found that claimed it's always a barrier):
include/asm-generic/bitops/instrumented-atomic.h
So either one doc and the implementation are wrong, or the other doc is wrong.
--- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit( unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
}
but the above is not just whitespace-damaged, it's entirely untested and based purely on me looking at that code.
That does work, I in fact tried that fix first to prove that the early-exit was the problem.
I don't have a particularly strong opinion on which way is right here, I just saw the implementation matched the docs (that I found) and Will commented on the other thread implying this was all deliberate.
- Hector
On Mon, Aug 15, 2022 at 10:36 PM Hector Martin marcan@marcan.st wrote:
These ops are documented in Documentation/atomic_bitops.txt as being unordered in the failure ("bit was already set" case), and that matches the generic implementation (which arm64 uses).
Yeah, that documentation is pure garbage. We need to fix it.
I think that "unordered on failure" was added at the same time that the generic implementation was rewritten.
IOW, the documentation was changed to match that broken implementation, but it's clearly completely broken.
I think I understand *why* it's broken - it looks like a "harmless" optimization. After all, if the bitop doesn't do anything, there's nothing to order it with.
It makes a certain amount of sense - as long as you don't think about it too hard.
The reason it is completely and utterly broken is that it's not actually just "the bitop doesn't do anything". Even when it doesn't change the bit value, just the ordering of the read of the old bit value can be meaningful, exactly for that case of "I added more work to the queue, I need to set the bit to tell the consumers, and if I'm the first person to set the bit I may need to wake the consumer up".
On the other hand, Twitter just pointed out that contradicting documentation exists (I believe this was the source of the third party doc I found that claimed it's always a barrier):
It's not just that other documentation exists - it's literally that the unordered semantics don't even make sense, and don't match reality and history.
And nobody thought about it or caught it at the time.
The Xen people seem to have noticed at some point, and tried to introduce a "sync_set_set()"
So either one doc and the implementation are wrong, or the other doc is wrong.
That doc and the generic implementation is clearly wrong.
Linus
On 2022/08/16 14:52, Linus Torvalds wrote:
I think I understand *why* it's broken - it looks like a "harmless" optimization. After all, if the bitop doesn't do anything, there's nothing to order it with.
It makes a certain amount of sense - as long as you don't think about it too hard.
The reason it is completely and utterly broken is that it's not actually just "the bitop doesn't do anything". Even when it doesn't change the bit value, just the ordering of the read of the old bit value can be meaningful, exactly for that case of "I added more work to the queue, I need to set the bit to tell the consumers, and if I'm the first person to set the bit I may need to wake the consumer up".
This is the same reason I argued queue_work() itself needs to have a similar guarantee, even when it doesn't queue work (and I updated the doc to match). If test_and_set_bit() is used in this kind of context often in the kernel, clearly the current implementation/doc clashes with that.
As I said, I don't have any particular beef in this fight, but this is horribly broken on M1/2 right now, so I'll send a patch to change the bitops instead and you all can fight it out over which way is correct :)
On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
This is the same reason I argued queue_work() itself needs to have a similar guarantee, even when it doesn't queue work (and I updated the doc to match). If test_and_set_bit() is used in this kind of context often in the kernel, clearly the current implementation/doc clashes with that.
Kernel code all over the place rely on the fact that test_and_set_bit provides a memory barrier. So this bug that you've discovered is not at all isolated to the workqeueue system. It'll break the kernel in lots of places in exactly the same way.
As I said, I don't have any particular beef in this fight, but this is horribly broken on M1/2 right now, so I'll send a patch to change the bitops instead and you all can fight it out over which way is correct :)
Please do.
Thanks,
On 2022/08/16 16:48, Herbert Xu wrote:
On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
This is the same reason I argued queue_work() itself needs to have a similar guarantee, even when it doesn't queue work (and I updated the doc to match). If test_and_set_bit() is used in this kind of context often in the kernel, clearly the current implementation/doc clashes with that.
Kernel code all over the place rely on the fact that test_and_set_bit provides a memory barrier. So this bug that you've discovered is not at all isolated to the workqeueue system. It'll break the kernel in lots of places in exactly the same way.
Now I'm surprised this isn't failing all over the place, given that... these things are annoyingly subtle.
Still would want Will & Peter to chime in, of course.
As I said, I don't have any particular beef in this fight, but this is horribly broken on M1/2 right now, so I'll send a patch to change the bitops instead and you all can fight it out over which way is correct :)
Please do.
Already did, but I just realized I forgot to Cc you. Sorry about that, hope you can pick it up through the MLs:
https://lore.kernel.org/asahi/20220816070311.89186-1-marcan@marcan.st/T/#u
- Hector
On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
The bug seems to go back to commit e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the fix looks to be as simple as just removing that early READ_ONCE return case (test_and_clear has the same bug).
Will?
I think this is the source of all this:
commit 61e02392d3c7ecac1f91c0a90a8043d67e081846 Author: Will Deacon will@kernel.org Date: Tue Feb 13 13:30:19 2018 +0000
locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()
Unfortunately it doesn't work because lots of kernel code rely on the memory barrier semantics of test_and_set_bit.
If ARM really wants this change, then eitehr create a new API for it or audit every single existing use in the kernel.
Patching the documentation and then relying on it is magical thinking.
Cheers,
On Mon, Aug 15, 2022 at 10:49 PM Herbert Xu herbert@gondor.apana.org.au wrote:
I think this is the source of all this: [ commit 61e02392d3c7 ]
Yes, that doc update seems to actually predate the broken code.
But you can actually see how that commit clearly was only meant to change the "test_and_set_bit_lock()" case. IOW, in that commit, the test_and_set_bit() comments clearly say
* test_and_set_bit - Set a bit and return its old value * This operation is atomic and cannot be reordered. * It may be reordered on other architectures than x86. * It also implies a memory barrier.
(that "It may be reordered on other architectures than x86" does show clear confusion, since it also says "It also implies a memory barrier")
And the very commit that adds that
- RMW operations that are conditional are unordered on FAILURE,
language then updates the comment only above test_and_set_bit_lock().
And yes, on failure to lock, ordering doesn't matter. You didn't get the bit lock, there is no ordering for you.
But then I think that mistake in the documentation (the relaxed semantics was only for the "lock" version of the bitops, not for the rest ot it) then later led to the mistake in the code.
End result: test_and_set_bit_lock() does indeed only imply an ordering if it got the lock (ie it was successful).
But test_and_set_bit() itself is always "succesful". If the bit was already set, that's not any indication of any "failure". It's just an indication that it was set. Nothing more, nothing less, and the memory barrier is still required regardless.
Linus
On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu herbert@gondor.apana.org.au wrote:
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Yes, the bitops are kind of strange for various legacy reasons:
- set/clear_bit is atomic, but without a memory barrier, and need a
"smp_mb__before_atomic()" or similar for barriers
test_and_set/clear_bit() are atomic, _and_ are memory barriers
test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
_weaker_ than full memory barriers, but sufficient for locking (ie acquire/release)
Does any of this make any sense at all? No. But those are the documented semantics exactly because people were worried about test_and_set_bit being used for locking, since on x86 all the atomics are also memory barriers.
From looking at it, the asm-generic implementation is a bit questionable, though. In particular, it does
if (READ_ONCE(*p) & mask) return 1;
so it's *entirely* unordered for the "bit was already set" case.
That looks very wrong to me, since it basically means that the test_and_set_bit() can return "bit was already set" based on an old value - not a memory barrier at all.
So if you use "test_and_set_bit()" as some kind of "I've done my work, now I am going to set the bit to tell people to pick it up", then that early "bit was already set" code completely breaks it.
Now, arguably our atomic bitop semantics are very very odd, and it might be time to revisit them. But that code looks very very buggy to me.
The bug seems to go back to commit e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the fix looks to be as simple as just removing that early READ_ONCE return case (test_and_clear has the same bug).
Will?
Right, this looks like it's all my fault, so sorry about that.
In an effort to replace the spinlock-based atomic bitops with a version based on atomic instructions in e986a0d6cb36, I inadvertently added this READ_ONCE() shortcut to test_and_set_bit() because at the time that's what we had (incorrectly) documented in our attempts at cleaning things up in this area. I confess that I have never been comfortable with the comment for test_and_set_bit() prior to my problematic patch:
/** * test_and_set_bit - Set a bit and return its old value * @nr: Bit to set * @addr: Address to count from * * This operation is atomic and cannot be reordered. * It may be reordered on other architectures than x86. * It also implies a memory barrier. */
so while Peter and I were trying to improve the documentation for atomics and memory barriers we clearly ended up making the wrong call trying to treat this like e.g. a cmpxchg() (which has the unordered-on-failure semantics).
It's worth noting that with the spinlock-based implementation (i.e. prior to e986a0d6cb36) then we would have the same problem on architectures that implement spinlocks with acquire/release semantics; accesses from outside of the critical section can drift in and reorder with each other there, so the conversion looked legitimate to me in isolation and I vaguely remember going through callers looking for potential issues. Alas, I obviously missed this case.
So it looks to me like we need to:
1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier regardless of the value which is read from memory. The lock/unlock flavours can remain as-is.
2. Fix the documentation
3. Figure out what to do about architectures building atomics out of spinlocks (probably ok as lock+unlock == full barrier there?)
4. Accept my sincerest apologies for the mess!
IOW, the proper fix for this should, I think, look something like this (whitespace mangled) thing:
--- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit( unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
}
but the above is not just whitespace-damaged, it's entirely untested and based purely on me looking at that code.
Yes, I think that's step 1, thanks! I'm a bit worried about the perf numbers on the other thread, but we can get to the bottom of that separately.
Will
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu herbert@gondor.apana.org.au wrote:
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Yes, the bitops are kind of strange for various legacy reasons:
- set/clear_bit is atomic, but without a memory barrier, and need a
"smp_mb__before_atomic()" or similar for barriers
test_and_set/clear_bit() are atomic, _and_ are memory barriers
test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
_weaker_ than full memory barriers, but sufficient for locking (ie acquire/release)
Does any of this make any sense at all? No. But those are the documented semantics exactly because people were worried about test_and_set_bit being used for locking, since on x86 all the atomics are also memory barriers.
From looking at it, the asm-generic implementation is a bit questionable, though. In particular, it does
if (READ_ONCE(*p) & mask) return 1;
so it's *entirely* unordered for the "bit was already set" case.
That looks very wrong to me, since it basically means that the test_and_set_bit() can return "bit was already set" based on an old value - not a memory barrier at all.
So if you use "test_and_set_bit()" as some kind of "I've done my work, now I am going to set the bit to tell people to pick it up", then that early "bit was already set" code completely breaks it.
Now, arguably our atomic bitop semantics are very very odd, and it might be time to revisit them. But that code looks very very buggy to me.
The bug seems to go back to commit e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the fix looks to be as simple as just removing that early READ_ONCE return case (test_and_clear has the same bug).
Will?
Right, this looks like it's all my fault, so sorry about that.
In an effort to replace the spinlock-based atomic bitops with a version based on atomic instructions in e986a0d6cb36, I inadvertently added this READ_ONCE() shortcut to test_and_set_bit() because at the time that's what we had (incorrectly) documented in our attempts at cleaning things up in this area. I confess that I have never been comfortable with the comment for test_and_set_bit() prior to my problematic patch:
/**
- test_and_set_bit - Set a bit and return its old value
- @nr: Bit to set
- @addr: Address to count from
- This operation is atomic and cannot be reordered.
- It may be reordered on other architectures than x86.
- It also implies a memory barrier.
*/
so while Peter and I were trying to improve the documentation for atomics and memory barriers we clearly ended up making the wrong call trying to treat this like e.g. a cmpxchg() (which has the unordered-on-failure semantics).
It's worth noting that with the spinlock-based implementation (i.e. prior to e986a0d6cb36) then we would have the same problem on architectures that implement spinlocks with acquire/release semantics; accesses from outside of the critical section can drift in and reorder with each other there, so the conversion looked legitimate to me in isolation and I vaguely remember going through callers looking for potential issues. Alas, I obviously missed this case.
I just to want to mention that although spinlock-based atomic bitops don't provide the full barrier in test_and_set_bit(), but they don't have the problem spotted by Hector, because test_and_set_bit() and clear_bit() sync with each other via locks:
test_and_set_bit(): lock(..); old = *p; // mask is already set by other test_and_set_bit() *p = old | mask; unlock(...); clear_bit(): lock(..); *p ~= mask; unlock(..);
so "having a full barrier before test_and_set_bit()" may not be the exact thing we need here, as long as a test_and_set_bit() can sync with a clear_bit() uncondiontally, then the world is safe. For example, we can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
test_and_set_bit(): atomic_long_fetch_or_release(..); // pair with clear_bit() // guarantee everything is // observed. clear_bit(): atomic_long_fetch_andnot_acquire(..); , maybe that's somewhat cheaper than a full barrier implementation.
Thoughts? Just to find the exact ordering requirement for bitops.
Regards, Boqun
So it looks to me like we need to:
Upgrade test_and_{set,clear}_bit() to have a full memory barrier regardless of the value which is read from memory. The lock/unlock flavours can remain as-is.
Fix the documentation
Figure out what to do about architectures building atomics out of spinlocks (probably ok as lock+unlock == full barrier there?)
Accept my sincerest apologies for the mess!
IOW, the proper fix for this should, I think, look something like this (whitespace mangled) thing:
--- a/include/asm-generic/bitops/atomic.h +++ b/include/asm-generic/bitops/atomic.h @@ -39,9 +39,6 @@ arch_test_and_set_bit( unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
return 1;
- old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p); return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
return 0;
- old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p); return !!(old & mask);
}
but the above is not just whitespace-damaged, it's entirely untested and based purely on me looking at that code.
Yes, I think that's step 1, thanks! I'm a bit worried about the perf numbers on the other thread, but we can get to the bottom of that separately.
Will
On 16/08/2022 23.55, Boqun Feng wrote:
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
It's worth noting that with the spinlock-based implementation (i.e. prior to e986a0d6cb36) then we would have the same problem on architectures that implement spinlocks with acquire/release semantics; accesses from outside of the critical section can drift in and reorder with each other there, so the conversion looked legitimate to me in isolation and I vaguely remember going through callers looking for potential issues. Alas, I obviously missed this case.
I just to want to mention that although spinlock-based atomic bitops don't provide the full barrier in test_and_set_bit(), but they don't have the problem spotted by Hector, because test_and_set_bit() and clear_bit() sync with each other via locks:
test_and_set_bit(): lock(..); old = *p; // mask is already set by other test_and_set_bit() *p = old | mask; unlock(...); clear_bit(): lock(..); *p ~= mask; unlock(..);
so "having a full barrier before test_and_set_bit()" may not be the exact thing we need here, as long as a test_and_set_bit() can sync with a clear_bit() uncondiontally, then the world is safe. For example, we can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
test_and_set_bit(): atomic_long_fetch_or_release(..); // pair with clear_bit() // guarantee everything is // observed. clear_bit(): atomic_long_fetch_andnot_acquire(..); , maybe that's somewhat cheaper than a full barrier implementation.
Thoughts? Just to find the exact ordering requirement for bitops.
It's worth pointing out that the workqueue code does *not* pair test_and_set_bit() with clear_bit(). It does an atomic_long_set() instead (and then there are explicit barriers around it, which are expected to pair with the implicit barrier in test_and_set_bit()). If we define test_and_set_bit() to only sync with clear_bit() and not necessarily be a true barrier, that breaks the usage of the workqueue code.
- Hector
On Wed, Aug 17, 2022 at 01:22:09AM +0900, Hector Martin wrote:
On 16/08/2022 23.55, Boqun Feng wrote:
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
It's worth noting that with the spinlock-based implementation (i.e. prior to e986a0d6cb36) then we would have the same problem on architectures that implement spinlocks with acquire/release semantics; accesses from outside of the critical section can drift in and reorder with each other there, so the conversion looked legitimate to me in isolation and I vaguely remember going through callers looking for potential issues. Alas, I obviously missed this case.
I just to want to mention that although spinlock-based atomic bitops don't provide the full barrier in test_and_set_bit(), but they don't have the problem spotted by Hector, because test_and_set_bit() and clear_bit() sync with each other via locks:
test_and_set_bit(): lock(..); old = *p; // mask is already set by other test_and_set_bit() *p = old | mask; unlock(...); clear_bit(): lock(..); *p ~= mask; unlock(..);
so "having a full barrier before test_and_set_bit()" may not be the exact thing we need here, as long as a test_and_set_bit() can sync with a clear_bit() uncondiontally, then the world is safe. For example, we can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
test_and_set_bit(): atomic_long_fetch_or_release(..); // pair with clear_bit() // guarantee everything is // observed. clear_bit(): atomic_long_fetch_andnot_acquire(..); , maybe that's somewhat cheaper than a full barrier implementation.
Thoughts? Just to find the exact ordering requirement for bitops.
It's worth pointing out that the workqueue code does *not* pair test_and_set_bit() with clear_bit(). It does an atomic_long_set() instead (and then there are explicit barriers around it, which are expected to pair with the implicit barrier in test_and_set_bit()). If we define test_and_set_bit() to only sync with clear_bit() and not necessarily be a true barrier, that breaks the usage of the workqueue code.
Ah, I miss that, but that means the old spinlock-based atomics are totally broken unless spinlock means full barriers on these archs.
But still, if we define test_and_set_bit() as RELEASE atomic instead of a full barrier + atomic, it should work for workqueue, right? Do we actually need extra ordering here?
WRITE_ONCE(*x, 1); // A test_and_set_bit(..); // a full barrier will order A & B WRITE_ONCE(*y, 1); // B
That's something I want to figure out.
Regards, Boqun
- Hector
On Tue, Aug 16, 2022 at 9:22 AM Hector Martin marcan@marcan.st wrote:
It's worth pointing out that the workqueue code does *not* pair test_and_set_bit() with clear_bit(). It does an atomic_long_set() instead
Yes. That code is much too subtle.
And yes, I think those barriers are
(a) misleading
(b) don't work with the "serialize bits using spinlock" model at all
It's a good example of "we need to really have a better model for this".
Linus
On Tue, Aug 16, 2022 at 6:42 AM Will Deacon will@kernel.org wrote:
Will?
Right, this looks like it's all my fault, so sorry about that.
It's interesting how this bug has existed for basically four years.
I suspect that the thing is fairly hard to hit in practice, particularly on the kinds of devices most common (ie limited OoO windows on most phone SoCs).
Together with phone loads probably not generally being all that exciting from a kernel standpoint (a lot of driver work, probably not a lot of workqueue stuff).
- Upgrade test_and_{set,clear}_bit() to have a full memory barrier regardless of the value which is read from memory. The lock/unlock flavours can remain as-is.
I've applied Hector's "locking/atomic: Make test_and_*_bit() ordered on failure".
- Fix the documentation
That patch had a bit of a fix, but it is probably worth looking at more.
- Figure out what to do about architectures building atomics out of spinlocks (probably ok as lock+unlock == full barrier there?)
Yeah, I wonder how we should describe the memory ordering here. We've always punted on it, saying it's a "memory barrier", but that has been partly a "look, that's the traditional x86 model".
And the traditional x86 model really sees the locked RMW operations as being one single operation - in ways that most other architectures don't.
So on x86, it's more than "this implies a memory barrier" - it's also that there is no visible load-modify-store sequence where you can start asking "what about the ordering of the _load_ wrt the preceding memory operations".
That makes the x86 behavior very easy to think about, but means that when you have bitops implemented other ways, you have questions that are much harder to answer.
So in a Lock+read+op+write+unlock sequence, you can have preceding operations move into the locked region, and mix with the read+op+write side.
- Accept my sincerest apologies for the mess!
I don't think you were the source of the mess. The *source* of the mess is that we've always had very messy rules about the bitops in particular.
I think the *code* is fixed (at least wrt the generic implementation, I think the other models are up for discussion), but I think the real issue is how we should describe the requirements.
So I think we have at least three cases we need to deal with:
(a) the people who just want atomics, and don't care about any ordering. They are bound to exist.
(b) the people who want "acquire"-like semantics. I think the existing test_and_set_bit_lock() is fine
(c) the people who want "release"-like semantics, where the "test-and-set-bit" is for announcing "I'm done, was I the first one?".
(d) the full barrier case
I think we currently actively have (b) and (d), but it's possible that the workqueue case really is only (c).
And I think the spinlock implementation really most naturally has that (c) semantics - you don't necessarily get some theoretical full memory barrier, but you *do* get those "handover" semantics.
Maybe we never really want or need (d) at all?
So I get the feeling that we really shouldn't specify "test_and_set_bit()" in terms of memory barriers at all. I *think* it would be more natural to describe them in terms of "handover" (ie acquire/release), and then the spinlock semantics are obviously fine.
So I htink the code problem is easy, I think the real problem here has always been bad documentation, and it would be really good to clarify that.
Comments?
Linus
On Tue, Aug 16, 2022 at 09:41:52AM -0700, Linus Torvalds wrote: .
So I htink the code problem is easy, I think the real problem here has always been bad documentation, and it would be really good to clarify that.
Comments?
The problem is that test_and_set_bit has been unambiguously documented to have memory barriers since 2005:
commit 3085f02b869d980c5588f3e8fb136b0b465a2759 Author: David S. Miller davem@nuts.davemloft.net Date: Fri Feb 4 23:39:15 2005 -0800
[DOC]: Add asm/atomic.h asm/bitops.h implementation specification.
And this is what it says:
+ int test_and_set_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatils unsigned long *addr);
...snip...
+These routines, like the atomic_t counter operations returning values, +require explicit memory barrier semantics around their execution. All +memory operations before the atomic bit operation call must be made +visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1;
This file wasn't removed until 16/11/2020 by f0400a77ebdc.
In that time people who wrote code using test_and_set_bit could have legitimately relied on the memory barrier as documented. Changing this restrospectively is dangerous.
I'm fine with introducing new primitives that have different properties, and then converting the existing users of test_and_set_bit over on a case-by-case basis.
Cheers,
Hello, Will.
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
/**
- test_and_set_bit - Set a bit and return its old value
- @nr: Bit to set
- @addr: Address to count from
- This operation is atomic and cannot be reordered.
- It may be reordered on other architectures than x86.
- It also implies a memory barrier.
*/
so while Peter and I were trying to improve the documentation for atomics and memory barriers we clearly ended up making the wrong call trying to treat this like e.g. a cmpxchg() (which has the unordered-on-failure semantics).
I think the doc can be improved here. atomic_t.txt says under ORDERING:
- RMW operations that have a return value are fully ordered;
- RMW operations that are conditional are unordered on FAILURE, otherwise the above rules apply.
But nothing spells out what's conditional. Maybe it's okay to expect people to read this doc and extrapolate how it applies, but I think it'd be better if we spell out clearly per operaiton so that readers can search for a speicific operation and then follow what the rules are from there.
It bothers me that there doesn't seem to be a comprehensive operation-indexed doc on the subject. memory-barrier.txt doesn't cover which operations do what barriers. atomic_t.txt and atomic_bitops.txt cover the atomic_t operations but not in a comprehensive or searchable manner (e.g. does test_and_set_bit() return 0 or 1 on success?) and it's not clear where to look for non-atomic_t atomic operations. I guess it's implied that they follow the same rules as atomic_t counterparts but I can't seem to find that spelled out anywhere. The source code docs are layered and dispersed for generic and arch implemetnations making them difficult to follow.
It'd be awesome if the documentation situation can be improved.
Thanks.
Hello,
On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
Tejun Heo tj@kernel.org wrote:
Oh, tricky one and yeah you're absolutely right that it makes no sense to not guarantee barrier semantics when already pending. I didn't even know test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting down and fixing this. Applied to wq/for-6.0-fixes.
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Alright, reverting.
Thanks.
On 17/08/2022 01.26, Tejun Heo wrote:
Hello,
On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
Tejun Heo tj@kernel.org wrote:
Oh, tricky one and yeah you're absolutely right that it makes no sense to not guarantee barrier semantics when already pending. I didn't even know test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting down and fixing this. Applied to wq/for-6.0-fixes.
Please revert this as test_and_set_bit was always supposed to be a full memory barrier. This is an arch bug.
Alright, reverting.
FYI, Linus applied the alternate fix directly to his tree already, so this is fixed on M1 either way. However, you may want to pay attention to the discussion about the subtlety of the workqueue code pairing test_and_set_bit() not with clear_bit(), but rather atomic_long_set(), since it *could* imply it is still broken or might be broken in the future, on other architectures.
- Hector
Hector Martin marcan@marcan.st wrote:
This has been broken since the dawn of time, and it was incompletely fixed by 346c09f80459, which added the necessary barriers in the work execution path but failed to account for the missing barrier in the test_and_set_bit() failure case. Fix it by switching to atomic_long_fetch_or(), which does have unconditional barrier semantics regardless of whether the bit was already set or not (this is actually just test_and_set_bit() minus the early exit path).
test_and_set_bit is supposed to contain a full memory barrier. If it doesn't then your arch is broken and needs to be fixed.
Changing this one spot is pointless because such assumptions are all over the kernel.
Cheers,
On 16/08/2022 13.14, Herbert Xu wrote:
Hector Martin marcan@marcan.st wrote:
This has been broken since the dawn of time, and it was incompletely fixed by 346c09f80459, which added the necessary barriers in the work execution path but failed to account for the missing barrier in the test_and_set_bit() failure case. Fix it by switching to atomic_long_fetch_or(), which does have unconditional barrier semantics regardless of whether the bit was already set or not (this is actually just test_and_set_bit() minus the early exit path).
test_and_set_bit is supposed to contain a full memory barrier. If it doesn't then your arch is broken and needs to be fixed.
Changing this one spot is pointless because such assumptions are all over the kernel.
Documentation/atomic_bitops.txt and the asm-generic implementaton disagree with you, so this isn't quite as simple as "your arch is broken" :-)
- Hector
linux-stable-mirror@lists.linaro.org