The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org --- I think this makes it less broken, but I'm not even sure if it's enough or more memory barriers or an rcu_dereference() would be correct. The spsc_queue is, of course, not documented and the existing barrier comments are either false or not telling.
If someone has an idea, shoot us the info. Otherwise I think this is the right thing to do for now.
P. --- include/drm/spsc_queue.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h index ee9df8cc67b7..39bada748ffc 100644 --- a/include/drm/spsc_queue.h +++ b/include/drm/spsc_queue.h @@ -54,7 +54,7 @@ static inline void spsc_queue_init(struct spsc_queue *queue)
static inline struct spsc_node *spsc_queue_peek(struct spsc_queue *queue) { - return queue->head; + return READ_ONCE(queue->head); }
static inline int spsc_queue_count(struct spsc_queue *queue)
As far as I can see that is not correct or rather not complete.
The peek function should only be used to opportunistically look at the top of the queue. It would only be problematic if it returns a non NULL value once and then a NULL value later.
The whole idea of the SPSC is that it is barrier-free and the signaling of new entries to the consumer side is providing the barrier.
So basically on the provider side you have spsc_push(entry) wake_up(consumer)
And on the consumer side you have:
woken_up_by_provider() { entry = spsc_peek(); ... spsc_pop(); }
The problem we are facing here is that the spsc only provides the guarantee that you see the entry pointer, but not the content of entry itself.
So use cases like:
woken_up_by_provider() { while (entry = spsc_peek()) { ... spsc_pop(); } }
Are illegal since you don't have the correct memory barriers any more.
Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
Question is what should we do?
Regards, Christian.
On 11/10/25 09:19, Philipp Stanner wrote:
The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org
I think this makes it less broken, but I'm not even sure if it's enough or more memory barriers or an rcu_dereference() would be correct. The spsc_queue is, of course, not documented and the existing barrier comments are either false or not telling.
If someone has an idea, shoot us the info. Otherwise I think this is the right thing to do for now.
P.
include/drm/spsc_queue.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h index ee9df8cc67b7..39bada748ffc 100644 --- a/include/drm/spsc_queue.h +++ b/include/drm/spsc_queue.h @@ -54,7 +54,7 @@ static inline void spsc_queue_init(struct spsc_queue *queue) static inline struct spsc_node *spsc_queue_peek(struct spsc_queue *queue) {
- return queue->head;
- return READ_ONCE(queue->head);
} static inline int spsc_queue_count(struct spsc_queue *queue)
Please don't top-post :( FDFY:
On 11/10/25 09:19, Philipp Stanner wrote:
The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org
I think this makes it less broken, but I'm not even sure if it's enough or more memory barriers or an rcu_dereference() would be correct. The spsc_queue is, of course, not documented and the existing barrier comments are either false or not telling.
If someone has an idea, shoot us the info. Otherwise I think this is the right thing to do for now.
P.
include/drm/spsc_queue.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h index ee9df8cc67b7..39bada748ffc 100644 --- a/include/drm/spsc_queue.h +++ b/include/drm/spsc_queue.h @@ -54,7 +54,7 @@ static inline void spsc_queue_init(struct spsc_queue *queue) static inline struct spsc_node *spsc_queue_peek(struct spsc_queue *queue) {
- return queue->head;
- return READ_ONCE(queue->head);
}
On Mon, 2025-11-10 at 12:24 +0100, Christian König wrote:
As far as I can see that is not correct or rather not complete.
It cannot be incorrect by definition, because it simply ensures that the load will actually take place there.
Incomplete it can be.
The peek function should only be used to opportunistically look at the top of the queue. It would only be problematic if it returns a non NULL value once and then a NULL value later.
The whole idea of the SPSC is that it is barrier-free and the signaling of new entries to the consumer side is providing the barrier.
So basically on the provider side you have spsc_push(entry) wake_up(consumer)
And on the consumer side you have:
woken_up_by_provider() { entry = spsc_peek(); ... spsc_pop(); }
Well no, the scheduler can pick up the next job whenever it feels like it. Restarting it for example will have it peek into your queue, regardless of wake events.
In any case this is a highly fragile design. See below, too.
The problem we are facing here is that the spsc only provides the guarantee that you see the entry pointer, but not the content of entry itself.
So use cases like:
woken_up_by_provider() { while (entry = spsc_peek()) { ... spsc_pop(); } }
Are illegal since you don't have the correct memory barriers any more.
I can't follow. Are you saying that spsc_queue_peek() is correct because you know for sure that when it's used no one else might be changing that pointer?
Even if that were true this design is highly fragile.
Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
s/explain/document :)
As discussed few weeks ago with Sima and Tvrtko, what we really need to move to in all of DRM is this development pattern:
1. For parallel code, at first by default use a boring, horribly slow (sarcasm) spinlock. BTW I'm not even convinced that a spinlock is slower than lockless tricks. Paul's book states that a CAS atomic instruction takes about 60 cycles, and taking a lock takes 100. 2. If you want to do parallelism without locks, you need to justify it really well. "rmb() so list_empty() works without a lock" doesn't qualify, but real use case speedups. 3. If you write lockless infrastructure, you need to document it really well. In particular you need to state: 1. How it works 2. What the rules are
See llist.h as an example. It clearly states when you need a lock and when you don't. Or RCU. No one could use it without such good documentation.
I have no idea whether spsc_queue is correctly implemented (I doubt it), and if even a highly experienced dev like you takes "an eternity" (quote) to understand it, one is really tempted to dream of spinlock_t, which has very clear semantics and is easily understood even by beginners.
Question is what should we do?
Easy!
Mid-term, we should replace spsc_queue with a boring, locked, super- slow linked list ^_^
The spsc_queue was designed and – perhaps – perf-measured when RR was the only scheduling policy.
Since the FIFO rework, where FIFO became the default policy, we now access our super efficient super fast lockless queue most of the time with the spinlock being taken immediately afterwards anyways. So we almost have a locked lockless queue now.
https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/...
Only push_job() often (?) still runs locklessly, but I'm not at all convinced that this is worth it. Not even performance-wise.
If you look at spsc_queue_push() you see that it 1. disables preemption, 2. uses atomic instructions and 3. has a ton of memory barries
– in other words, this is almost literally a manual re-implementation of a spinlock, just without mutual exclusion…
And even if something like spsc_queue would make sense and be actually worth it, then it should be provided to the entire kernel as common infrastructure, like llist.h, not hidden somewhere in DRM.
P.
On 11/10/25 13:27, Philipp Stanner wrote:
Please don't top-post :( FDFY:
On 11/10/25 09:19, Philipp Stanner wrote:
The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org
I think this makes it less broken, but I'm not even sure if it's enough or more memory barriers or an rcu_dereference() would be correct. The spsc_queue is, of course, not documented and the existing barrier comments are either false or not telling.
If someone has an idea, shoot us the info. Otherwise I think this is the right thing to do for now.
P.
include/drm/spsc_queue.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h index ee9df8cc67b7..39bada748ffc 100644 --- a/include/drm/spsc_queue.h +++ b/include/drm/spsc_queue.h @@ -54,7 +54,7 @@ static inline void spsc_queue_init(struct spsc_queue *queue) static inline struct spsc_node *spsc_queue_peek(struct spsc_queue *queue) {
- return queue->head;
- return READ_ONCE(queue->head);
}
On Mon, 2025-11-10 at 12:24 +0100, Christian König wrote:
As far as I can see that is not correct or rather not complete.
It cannot be incorrect by definition, because it simply ensures that the load will actually take place there.
Incomplete it can be.
The peek function should only be used to opportunistically look at the top of the queue. It would only be problematic if it returns a non NULL value once and then a NULL value later.
The whole idea of the SPSC is that it is barrier-free and the signaling of new entries to the consumer side is providing the barrier.
So basically on the provider side you have spsc_push(entry) wake_up(consumer)
And on the consumer side you have:
woken_up_by_provider() { entry = spsc_peek(); ... spsc_pop(); }
Well no, the scheduler can pick up the next job whenever it feels like it. Restarting it for example will have it peek into your queue, regardless of wake events.
In any case this is a highly fragile design. See below, too.
The problem we are facing here is that the spsc only provides the guarantee that you see the entry pointer, but not the content of entry itself.
So use cases like:
woken_up_by_provider() { while (entry = spsc_peek()) { ... spsc_pop(); } }
Are illegal since you don't have the correct memory barriers any more.
I can't follow. Are you saying that spsc_queue_peek() is correct because you know for sure that when it's used no one else might be changing that pointer?
Correct, yes. That's the whole idea. I mean SPSC stands for single producer single consumer.
Even if that were true this design is highly fragile.
Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
s/explain/document :)
As discussed few weeks ago with Sima and Tvrtko, what we really need to move to in all of DRM is this development pattern:
- For parallel code, at first by default use a boring, horribly slow (sarcasm) spinlock. BTW I'm not even convinced that a spinlock is slower than lockless tricks. Paul's book states that a CAS atomic instruction takes about 60 cycles, and taking a lock takes 100.
The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Keep in mind that you can rather do a fused multiple add for a full 4x4 matrix before you take a single cache line miss.
- If you want to do parallelism without locks, you need to justify it really well. "rmb() so list_empty() works without a lock" doesn't qualify, but real use case speedups.
- If you write lockless infrastructure, you need to document it really well. In particular you need to state:
- How it works
- What the rules are
See llist.h as an example. It clearly states when you need a lock and when you don't.
The llist.h is actually pretty similar to the SPSC. I'm wondering why they don't have the same issues? E.g. is xchg() providing the memory barriers?
Or RCU. No one could use it without such good documentation.
I have no idea whether spsc_queue is correctly implemented (I doubt it), and if even a highly experienced dev like you takes "an eternity" (quote) to understand it, one is really tempted to dream of spinlock_t, which has very clear semantics and is easily understood even by beginners.
Question is what should we do?
Easy!
Mid-term, we should replace spsc_queue with a boring, locked, super- slow linked list ^_^
That's what the scheduler started with and the reason why that linked list was replaced with first a KFIFO and than the SPSC was because of lacking performance.
We could go back to the KFIFO design again, but a (double?) linked list is clearly going to show the same performance problems which originally triggered moving away from it.
The spsc_queue was designed and – perhaps – perf-measured when RR was the only scheduling policy.
Since the FIFO rework, where FIFO became the default policy, we now access our super efficient super fast lockless queue most of the time with the spinlock being taken immediately afterwards anyways. So we almost have a locked lockless queue now.
https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/...
That is not that relevant.
Only push_job() often (?) still runs locklessly, but I'm not at all convinced that this is worth it. Not even performance-wise.
That is what is relevant. And yes the difference was totally measurable, that's why this was originally implemented.
If you look at spsc_queue_push() you see that it
- disables preemption,
- uses atomic instructions and
- has a ton of memory barries
– in other words, this is almost literally a manual re-implementation of a spinlock, just without mutual exclusion…
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
Regards, Christian.
And even if something like spsc_queue would make sense and be actually worth it, then it should be provided to the entire kernel as common infrastructure, like llist.h, not hidden somewhere in DRM.
P.
On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
On 11/10/25 13:27, Philipp Stanner wrote:
Please don't top-post :( FDFY:
On 11/10/25 09:19, Philipp Stanner wrote:
The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org
[…]
Are illegal since you don't have the correct memory barriers any more.
I can't follow. Are you saying that spsc_queue_peek() is correct because you know for sure that when it's used no one else might be changing that pointer?
Correct, yes. That's the whole idea. I mean SPSC stands for single producer single consumer.
I know that it means that (although it's not documented and, funnily enough, I one day realized the meaning while standing under the shower after work).
Anyways, it's not documented that a user must not call drm_sched_entity_push_job() in parallel. It currently basically just works by the coincidence of no one doing that or rather no one running into the race.
Even if that were true this design is highly fragile.
Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
s/explain/document :)
As discussed few weeks ago with Sima and Tvrtko, what we really need to move to in all of DRM is this development pattern:
1. For parallel code, at first by default use a boring, horribly slow (sarcasm) spinlock. BTW I'm not even convinced that a spinlock is slower than lockless tricks. Paul's book states that a CAS atomic instruction takes about 60 cycles, and taking a lock takes 100.
The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Which cache lines? The spinlock's?
The queue data needs to move from one CPU to the other in either case. It's the same data that is being moved with spinlock protection.
A spinlock doesn't lead to more cache line moves as long as there's still just a single consumer / producer.
Keep in mind that you can rather do a fused multiple add for a full 4x4 matrix before you take a single cache line miss.
2. If you want to do parallelism without locks, you need to justify it really well. "rmb() so list_empty() works without a lock" doesn't qualify, but real use case speedups. 3. If you write lockless infrastructure, you need to document it really well. In particular you need to state: 1. How it works 2. What the rules are
See llist.h as an example. It clearly states when you need a lock and when you don't.
The llist.h is actually pretty similar to the SPSC. I'm wondering why they don't have the same issues? E.g. is xchg() providing the memory barriers?
I think that some atomic instructions contain barriers. But I'd have to check.
Or RCU. No one could use it without such good documentation.
I have no idea whether spsc_queue is correctly implemented (I doubt it), and if even a highly experienced dev like you takes "an eternity" (quote) to understand it, one is really tempted to dream of spinlock_t, which has very clear semantics and is easily understood even by beginners.
Question is what should we do?
Easy!
Mid-term, we should replace spsc_queue with a boring, locked, super- slow linked list ^_^
That's what the scheduler started with and the reason why that linked list was replaced with first a KFIFO and than the SPSC was because of lacking performance.
Such performance measurements must be included in the commit message, since they justify the entire commit.
Yet this is the entire justification given:
commit 27105db6c63a571b91d01e749d026105a1e63bcf Author: Andrey Grodzovsky Andrey.Grodzovsky@amd.com Date: Thu Oct 12 16:41:39 2017 -0400
drm/amdgpu: Add SPSC queue to scheduler.
It is intended to sabstitute the bounded fifo we are currently using.
Signed-off-by: Andrey Grodzovsky Andrey.Grodzovsky@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
We could go back to the KFIFO design again, but a (double?) linked list is clearly going to show the same performance problems which originally triggered moving away from it.
The spsc_queue was designed and – perhaps – perf-measured when RR was the only scheduling policy.
Since the FIFO rework, where FIFO became the default policy, we now access our super efficient super fast lockless queue most of the time with the spinlock being taken immediately afterwards anyways. So we almost have a locked lockless queue now.
https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/...
That is not that relevant.
If the lock being there is not relevant, then why can't we just take it and get rid off all those barriers and READ_ONCE's once and for all?
Only push_job() often (?) still runs locklessly, but I'm not at all convinced that this is worth it. Not even performance-wise.
That is what is relevant. And yes the difference was totally measurable, that's why this was originally implemented.
If you look at spsc_queue_push() you see that it 1. disables preemption, 2. uses atomic instructions and 3. has a ton of memory barries
– in other words, this is almost literally a manual re-implementation of a spinlock, just without mutual exclusion…
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
With a doubly linked list you can attach at the front and pull from the tail. How is that transferring many cache lines?
P.
On 11/10/25 15:20, Philipp Stanner wrote:
On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
On 11/10/25 13:27, Philipp Stanner wrote:
Please don't top-post :( FDFY:
On 11/10/25 09:19, Philipp Stanner wrote:
The spsc_queue is an unlocked, highly asynchronous piece of infrastructure. Its inline function spsc_queue_peek() obtains the head entry of the queue.
This access is performed without READ_ONCE() and is, therefore, undefined behavior. In order to prevent the compiler from ever reordering that access, or even optimizing it away, a READ_ONCE() is strictly necessary. This is easily proven by the fact that spsc_queue_pop() uses this very pattern to access the head.
Add READ_ONCE() to spsc_queue_peek().
Cc: stable@vger.kernel.org # v4.16+ Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.") Signed-off-by: Philipp Stanner phasta@kernel.org
[…]
Are illegal since you don't have the correct memory barriers any more.
I can't follow. Are you saying that spsc_queue_peek() is correct because you know for sure that when it's used no one else might be changing that pointer?
Correct, yes. That's the whole idea. I mean SPSC stands for single producer single consumer.
I know that it means that (although it's not documented and, funnily enough, I one day realized the meaning while standing under the shower after work).
Anyways, it's not documented that a user must not call drm_sched_entity_push_job() in parallel. It currently basically just works by the coincidence of no one doing that or rather no one running into the race.
Yeah, completely agree.
Even if that were true this design is highly fragile.
Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
s/explain/document :)
As discussed few weeks ago with Sima and Tvrtko, what we really need to move to in all of DRM is this development pattern:
1. For parallel code, at first by default use a boring, horribly slow (sarcasm) spinlock. BTW I'm not even convinced that a spinlock is slower than lockless tricks. Paul's book states that a CAS atomic instruction takes about 60 cycles, and taking a lock takes 100.
The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Which cache lines? The spinlock's?
The queue data needs to move from one CPU to the other in either case. It's the same data that is being moved with spinlock protection.
A spinlock doesn't lead to more cache line moves as long as there's still just a single consumer / producer.
Looking at a couple of examples:
1. spinlock + double linked list (which is what the scheduler used initially).
You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).
2. kfifo (attempt #2):
3 cache lines, one for the array, one for the rptr/wptr and one for the element. Plus the problem that you need to come up with some upper bound for it.
3. spsc (attempt #3)
2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
Keep in mind that you can rather do a fused multiple add for a full 4x4 matrix before you take a single cache line miss.
2. If you want to do parallelism without locks, you need to justify it really well. "rmb() so list_empty() works without a lock" doesn't qualify, but real use case speedups. 3. If you write lockless infrastructure, you need to document it really well. In particular you need to state: 1. How it works 2. What the rules are
See llist.h as an example. It clearly states when you need a lock and when you don't.
The llist.h is actually pretty similar to the SPSC. I'm wondering why they don't have the same issues? E.g. is xchg() providing the memory barriers?
I think that some atomic instructions contain barriers. But I'd have to check.
Or RCU. No one could use it without such good documentation.
I have no idea whether spsc_queue is correctly implemented (I doubt it), and if even a highly experienced dev like you takes "an eternity" (quote) to understand it, one is really tempted to dream of spinlock_t, which has very clear semantics and is easily understood even by beginners.
Question is what should we do?
Easy!
Mid-term, we should replace spsc_queue with a boring, locked, super- slow linked list ^_^
That's what the scheduler started with and the reason why that linked list was replaced with first a KFIFO and than the SPSC was because of lacking performance.
Such performance measurements must be included in the commit message, since they justify the entire commit.
Yet this is the entire justification given:
commit 27105db6c63a571b91d01e749d026105a1e63bcf Author: Andrey Grodzovsky Andrey.Grodzovsky@amd.com Date: Thu Oct 12 16:41:39 2017 -0400
drm/amdgpu: Add SPSC queue to scheduler.It is intended to sabstitute the bounded fifo we are currently using. Signed-off-by: Andrey Grodzovsky Andrey.Grodzovsky@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
We could go back to the KFIFO design again, but a (double?) linked list is clearly going to show the same performance problems which originally triggered moving away from it.
The spsc_queue was designed and – perhaps – perf-measured when RR was the only scheduling policy.
Since the FIFO rework, where FIFO became the default policy, we now access our super efficient super fast lockless queue most of the time with the spinlock being taken immediately afterwards anyways. So we almost have a locked lockless queue now.
https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/...
That is not that relevant.
If the lock being there is not relevant, then why can't we just take it and get rid off all those barriers and READ_ONCE's once and for all?
I think so, yes.
Only push_job() often (?) still runs locklessly, but I'm not at all convinced that this is worth it. Not even performance-wise.
That is what is relevant. And yes the difference was totally measurable, that's why this was originally implemented.
If you look at spsc_queue_push() you see that it 1. disables preemption, 2. uses atomic instructions and 3. has a ton of memory barries
– in other words, this is almost literally a manual re-implementation of a spinlock, just without mutual exclusion…
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
With a doubly linked list you can attach at the front and pull from the tail. How is that transferring many cache lines?
See above.
We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
Regards, Christian.
P.
On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote:
On 11/10/25 15:20, Philipp Stanner wrote:
On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
On 11/10/25 13:27, Philipp Stanner wrote: The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Which cache lines? The spinlock's?
The queue data needs to move from one CPU to the other in either case. It's the same data that is being moved with spinlock protection.
A spinlock doesn't lead to more cache line moves as long as there's still just a single consumer / producer.
Looking at a couple of examples:
- spinlock + double linked list (which is what the scheduler used initially).
You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).
list when pushing:
Lock + head (same cache line) + head->next head->next->next
when popping:
Lock + head + head->previous head->previous->previous
I don't see why you need a "current" element when you're always only touching head or tail.
- kfifo (attempt #2):
3 cache lines, one for the array, one for the rptr/wptr and one for the element. Plus the problem that you need to come up with some upper bound for it.
- spsc (attempt #3)
2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
Now we're speaking mostly the same language :]
If you could RB my DRM TODO patches we'd have a section for drm/sched, and there we could then soonish add an item for getting rid of spsc.
https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/
[…]
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
With a doubly linked list you can attach at the front and pull from the tail. How is that transferring many cache lines?
See above.
We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
I disagree. That would be a microbenchmark for a very narrow use case. It would only tell us that a specific patch slows things done for the microbenchmark, and we could only detect that if a developer runs the unit tests with and without his patches.
The few major reworks that touch such essentials have good realistic tests anyways, see Tvrtko's CFS series.
Lockless magic should always be justified by real world use cases.
By the way, back when spsc_queue was implemented, how large were the real world performance gains you meassured by saving that 1 cache line?
P.
On 11/10/25 16:55, Philipp Stanner wrote:
On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote:
On 11/10/25 15:20, Philipp Stanner wrote:
On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
On 11/10/25 13:27, Philipp Stanner wrote: The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Which cache lines? The spinlock's?
The queue data needs to move from one CPU to the other in either case. It's the same data that is being moved with spinlock protection.
A spinlock doesn't lead to more cache line moves as long as there's still just a single consumer / producer.
Looking at a couple of examples:
- spinlock + double linked list (which is what the scheduler used initially).
You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).
list when pushing:
Lock + head (same cache line) + head->next head->next->next
when popping:
Lock + head + head->previous head->previous->previous
I don't see why you need a "current" element when you're always only touching head or tail.
The current element is the one you insert or remove.
- kfifo (attempt #2):
3 cache lines, one for the array, one for the rptr/wptr and one for the element. Plus the problem that you need to come up with some upper bound for it.
- spsc (attempt #3)
2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
Now we're speaking mostly the same language :]
If you could RB my DRM TODO patches we'd have a section for drm/sched, and there we could then soonish add an item for getting rid of spsc.
https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/
I can't find that in my inbox anywhere. Can you send it out one more with my AMD mail address on explicit CC? Thanks in advance.
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
With a doubly linked list you can attach at the front and pull from the tail. How is that transferring many cache lines?
See above.
We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
I disagree. That would be a microbenchmark for a very narrow use case.
That is actually a rather common use case (unfortunately).
It would only tell us that a specific patch slows things done for the microbenchmark, and we could only detect that if a developer runs the unit tests with and without his patches.
I could trigger adding that to AMDs CI systems.
The few major reworks that touch such essentials have good realistic tests anyways, see Tvrtko's CFS series.
Lockless magic should always be justified by real world use cases.
By the way, back when spsc_queue was implemented, how large were the real world performance gains you meassured by saving that 1 cache line?
That was actually quite a bit. If you want a real world test case use glMark2 on any modern HW.
And yeah I know how ridicules that is, the problem is that we still have people using this as indicator for the command submission overhead.
Regards, Christian.
P.
On Mon, 2025-11-10 at 17:08 +0100, Christian König wrote:
On 11/10/25 16:55, Philipp Stanner wrote:
Lock + head (same cache line) + head->next head->next->next
when popping:
Lock + head + head->previous head->previous->previous
I don't see why you need a "current" element when you're always only touching head or tail.
The current element is the one you insert or remove.
That won't cause a cache miss because you have just created that element in the submitting CPU, owning it exclusively.
Now we're speaking mostly the same language :]
If you could RB my DRM TODO patches we'd have a section for drm/sched, and there we could then soonish add an item for getting rid of spsc.
https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/
I can't find that in my inbox anywhere. Can you send it out one more with my AMD mail address on explicit CC? Thanks in advance.
I can see to it. But can't you download the mbox file on the link and import it in your mail client?
Lockless magic should always be justified by real world use cases.
By the way, back when spsc_queue was implemented, how large were the real world performance gains you meassured by saving that 1 cache line?
That was actually quite a bit. If you want a real world test case use glMark2 on any modern HW.
And yeah I know how ridicules that is, the problem is that we still have people using this as indicator for the command submission overhead.
If we were living in a world were you'd always need 5 cache lines than that would just be the reality. And 5 would already be better than 8. So what's the deal? It seems this was not about "too slow" but about "faster than".
There's two topics which often make us pay the high price of buggyness and low maintainability. One of them being limitless performance optimizations.
I think that correctness always trumps speed. How happy does it make your customer if your driver delivers 5 fps more, but the game crashes 2 times per hour? (which btw happens to me with Steam on my amd card. Sometimes there are even hangs without a reset happening, which is strange. I'll open a ticket next time I see it happen).
P.
linux-stable-mirror@lists.linaro.org