On 11/10/25 16:55, Philipp Stanner wrote:
On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote:
On 11/10/25 15:20, Philipp Stanner wrote:
On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
On 11/10/25 13:27, Philipp Stanner wrote: The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
Which cache lines? The spinlock's?
The queue data needs to move from one CPU to the other in either case. It's the same data that is being moved with spinlock protection.
A spinlock doesn't lead to more cache line moves as long as there's still just a single consumer / producer.
Looking at a couple of examples:
- spinlock + double linked list (which is what the scheduler used initially).
You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).
list when pushing:
Lock + head (same cache line) + head->next head->next->next
when popping:
Lock + head + head->previous head->previous->previous
I don't see why you need a "current" element when you're always only touching head or tail.
The current element is the one you insert or remove.
- kfifo (attempt #2):
3 cache lines, one for the array, one for the rptr/wptr and one for the element. Plus the problem that you need to come up with some upper bound for it.
- spsc (attempt #3)
2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
Now we're speaking mostly the same language :]
If you could RB my DRM TODO patches we'd have a section for drm/sched, and there we could then soonish add an item for getting rid of spsc.
https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/
I can't find that in my inbox anywhere. Can you send it out one more with my AMD mail address on explicit CC? Thanks in advance.
The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another.
With a doubly linked list you can attach at the front and pull from the tail. How is that transferring many cache lines?
See above.
We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
I disagree. That would be a microbenchmark for a very narrow use case.
That is actually a rather common use case (unfortunately).
It would only tell us that a specific patch slows things done for the microbenchmark, and we could only detect that if a developer runs the unit tests with and without his patches.
I could trigger adding that to AMDs CI systems.
The few major reworks that touch such essentials have good realistic tests anyways, see Tvrtko's CFS series.
Lockless magic should always be justified by real world use cases.
By the way, back when spsc_queue was implemented, how large were the real world performance gains you meassured by saving that 1 cache line?
That was actually quite a bit. If you want a real world test case use glMark2 on any modern HW.
And yeah I know how ridicules that is, the problem is that we still have people using this as indicator for the command submission overhead.
Regards, Christian.
P.