On 11/25/25 18:02, Lucas Stach wrote:
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
I mean with the DMA engines we can trivially submit work which copies petabytes and needs hours or even a day to complete.
Without a global timeout that is a really nice deny of service attack against the system if you don't catch that.
Thanks, Christian.
Regards, Lucas