On Tue, Aug 16, 2022 at 6:42 AM Will Deacon will@kernel.org wrote:
Will?
Right, this looks like it's all my fault, so sorry about that.
It's interesting how this bug has existed for basically four years.
I suspect that the thing is fairly hard to hit in practice, particularly on the kinds of devices most common (ie limited OoO windows on most phone SoCs).
Together with phone loads probably not generally being all that exciting from a kernel standpoint (a lot of driver work, probably not a lot of workqueue stuff).
- Upgrade test_and_{set,clear}_bit() to have a full memory barrier regardless of the value which is read from memory. The lock/unlock flavours can remain as-is.
I've applied Hector's "locking/atomic: Make test_and_*_bit() ordered on failure".
- Fix the documentation
That patch had a bit of a fix, but it is probably worth looking at more.
- Figure out what to do about architectures building atomics out of spinlocks (probably ok as lock+unlock == full barrier there?)
Yeah, I wonder how we should describe the memory ordering here. We've always punted on it, saying it's a "memory barrier", but that has been partly a "look, that's the traditional x86 model".
And the traditional x86 model really sees the locked RMW operations as being one single operation - in ways that most other architectures don't.
So on x86, it's more than "this implies a memory barrier" - it's also that there is no visible load-modify-store sequence where you can start asking "what about the ordering of the _load_ wrt the preceding memory operations".
That makes the x86 behavior very easy to think about, but means that when you have bitops implemented other ways, you have questions that are much harder to answer.
So in a Lock+read+op+write+unlock sequence, you can have preceding operations move into the locked region, and mix with the read+op+write side.
- Accept my sincerest apologies for the mess!
I don't think you were the source of the mess. The *source* of the mess is that we've always had very messy rules about the bitops in particular.
I think the *code* is fixed (at least wrt the generic implementation, I think the other models are up for discussion), but I think the real issue is how we should describe the requirements.
So I think we have at least three cases we need to deal with:
(a) the people who just want atomics, and don't care about any ordering. They are bound to exist.
(b) the people who want "acquire"-like semantics. I think the existing test_and_set_bit_lock() is fine
(c) the people who want "release"-like semantics, where the "test-and-set-bit" is for announcing "I'm done, was I the first one?".
(d) the full barrier case
I think we currently actively have (b) and (d), but it's possible that the workqueue case really is only (c).
And I think the spinlock implementation really most naturally has that (c) semantics - you don't necessarily get some theoretical full memory barrier, but you *do* get those "handover" semantics.
Maybe we never really want or need (d) at all?
So I get the feeling that we really shouldn't specify "test_and_set_bit()" in terms of memory barriers at all. I *think* it would be more natural to describe them in terms of "handover" (ie acquire/release), and then the spinlock semantics are obviously fine.
So I htink the code problem is easy, I think the real problem here has always been bad documentation, and it would be really good to clarify that.
Comments?
Linus