[Linaro-mm-sig] Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout

26 Nov 2025


      On 11/25/25 18:02, Lucas Stach wrote:
...
...
...
I agree that distinguishing the use case that way is not ideal.
However, who has the knowledge of how the hardware is being used by
customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered)
timeout? Users get grumpy if you block things for 10 seconds, but they
get equally if not more grumpy when you kick out a valid workload that
just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
...
Fences are only defined to signal eventually, with no real concept of a
timeout. IMO all timeouts waiting for fences should be long enough to
only be considered last resort. You may want to give the user some
indication of a failed fence wait instead of stalling indefinitely, but
you really only want to do this after a quite long timeout, not in a
sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but
mm also depends on scheduled writeback making forward progress. You
don't kick out writeback requests after an arbitrary timeout just
because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of
500ms, with the option to extend the timeout when the GPU is still
making progress. We don't ever want to shoot down valid workloads (we
have some that need a few seconds to upload textures, etc on our wimpy
GPU), but you also don't want to wait multiple seconds until you detect
a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
...
So we use the short scheduler timeout to check in on the GPU and see if
it is still making progress (for graphics workloads by looking at the
frontend position within the command buffer and current primitive ID).
If we can deduce that the GPU is stuck we do the usual reset/recovery
dance within a reasonable reaction time, acceptable to users hitting a
real GPU hang. But if the GPU is making progress we will give an
infinite number of timeout extensions with no global timeout at all,
only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
I mean with the DMA engines we can trivially submit work which copies petabytes and needs hours or even a day to complete.
Without a global timeout that is a really nice deny of service attack against the system if you don't catch that.
Thanks,
Christian.
...
Regards,
Lucas

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout