On 11/26/25 13:37, Philipp Stanner wrote:
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
On 11/25/25 18:02, Lucas Stach wrote:
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
I think the long-term solution can only be fully fledged GPU scheduling with preemption. That's why we don't need such a timeout mechanism for userspace processes: the scheduler simply interrupts and lets someone else run.
Yeah absolutely.
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Userspace basically needs to co-operate and provide a buffer where the state on a pre-emption is saved into.
With the current situation, IDK either. My impression so far is that letting the drivers and driver programmers decide is the least bad choice.
Yeah, agree. It's the least evil thing we can do.
But I now have a plan how to proceed :)
Thanks for the input, Christian.
P.