Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
On 11/26/25 13:37, Philipp Stanner wrote:
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
[…]
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
Oh oh, isn't that our dear hang_limit? :)
Not really. The hang limit is the limit on how many times a hanging submit might be retried.
Limiting the number of timeout extensions is more of a safety net against a workloads which might appear to make progress to the kernel driver but in reality are stuck. After all, the kernel driver can only have limited knowledge of the GPU state and any progress check will have limited precision with false positives/negatives being a part of reality we have to deal with.
We agree that you can never really now whether userspace just submitted a while(true) job, don't we? Even if some GPU register still indicates "progress".
Yea, this is really hardware dependent on what you can read at runtime.
For etnaviv we define "progress" as the command frontend moving towards the end of the command buffer. As a single draw call in valid workloads can blow through our timeout we also use debug registers to look at the current primitive ID within a draw call. If userspace submits a workload that requires more than 500ms per primitive to finish we consider this an invalid workload and go through the reset/recovery motions.
Regards, Lucas