Hi everybody,
we have documented here https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#dma-fence-cro... that dma_fence objects must signal in a reasonable amount of time, but at the same time note that drivers might have a different idea of what reasonable means.
Recently I realized that this is actually not a good idea. Background is that the wall clock timeout means that for example the OOM killer might actually wait for this timeout to be able to terminate a process and reclaim the memory used. And this is just an example of how general kernel features might depend on that.
Some drivers and fence implementations used 10 seconds and that raised complains by end users. So at least amdgpu recently switched to 2 second which triggered an internal discussion about it.
This patch set here now adds a define to the dma_fence header which gives 2 seconds as reasonable amount of time. SW-sync is modified to always taint the kernel (since it doesn't has a timeout), VGEM is switched over to the new define and the scheduler gets a warning and taints the kernel if a driver uses a timeout longer than that.
I have not much intention of actually committing the patches (maybe except the SW-sync one), but question is if 2 seconds are reasonable?
Regards, Christian.
Add a define implementations can use as reasonable maximum signaling timeout. Document that if this timeout is exceeded by config options implementations should taint the kernel.
Tainting the kernel is important for bug reports to detect that end users might be using a problematic configuration.
Signed-off-by: Christian König christian.koenig@amd.com --- include/linux/dma-fence.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 64639e104110..b31dfa501c84 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -28,6 +28,20 @@ struct dma_fence_ops; struct dma_fence_cb; struct seq_file;
+/** + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout + * + * The dma_fence object has a deep inter dependency with core memory + * management, for a detailed explanation see section DMA Fences under + * Documentation/driver-api/dma-buf.rst. + * + * Because of this all dma_fence implementations must guarantee that each fence + * completes in a finite time. This define here now gives a reasonable value for + * the timeout to use. It is possible to use a longer timeout in an + * implementation but that should taint the kernel. + */ +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ) + /** * struct dma_fence - software synchronization primitive * @refcount: refcount for this fence
On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
Add a define implementations can use as reasonable maximum signaling timeout. Document that if this timeout is exceeded by config options implementations should taint the kernel.
Tainting the kernel is important for bug reports to detect that end users might be using a problematic configuration.
Signed-off-by: Christian König christian.koenig@amd.com
include/linux/dma-fence.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 64639e104110..b31dfa501c84 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -28,6 +28,20 @@ struct dma_fence_ops; struct dma_fence_cb; struct seq_file; +/**
- define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
- The dma_fence object has a deep inter dependency with core memory
- management, for a detailed explanation see section DMA Fences under
- Documentation/driver-api/dma-buf.rst.
- Because of this all dma_fence implementations must guarantee that each fence
- completes in a finite time. This define here now gives a reasonable value for
- the timeout to use. It is possible to use a longer timeout in an
- implementation but that should taint the kernel.
- */
+#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
HZ can change depending on the config. Is that really a good choice? I could see racy situations arising in some configs vs others
P.
On 11/25/25 08:55, Philipp Stanner wrote:
On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
Add a define implementations can use as reasonable maximum signaling timeout. Document that if this timeout is exceeded by config options implementations should taint the kernel.
Tainting the kernel is important for bug reports to detect that end users might be using a problematic configuration.
Signed-off-by: Christian König christian.koenig@amd.com
include/linux/dma-fence.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 64639e104110..b31dfa501c84 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -28,6 +28,20 @@ struct dma_fence_ops; struct dma_fence_cb; struct seq_file; +/**
- define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
- The dma_fence object has a deep inter dependency with core memory
- management, for a detailed explanation see section DMA Fences under
- Documentation/driver-api/dma-buf.rst.
- Because of this all dma_fence implementations must guarantee that each fence
- completes in a finite time. This define here now gives a reasonable value for
- the timeout to use. It is possible to use a longer timeout in an
- implementation but that should taint the kernel.
- */
+#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
HZ can change depending on the config. Is that really a good choice? I could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
Regards, Christian.
P.
On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
On 11/25/25 08:55, Philipp Stanner wrote:
+/**
- define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
- The dma_fence object has a deep inter dependency with core memory
- management, for a detailed explanation see section DMA Fences under
- Documentation/driver-api/dma-buf.rst.
- Because of this all dma_fence implementations must guarantee that each fence
- completes in a finite time. This define here now gives a reasonable value for
- the timeout to use. It is possible to use a longer timeout in an
- implementation but that should taint the kernel.
- */
+#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
HZ can change depending on the config. Is that really a good choice? I could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
On AMD64 maybe. What about the other architectures?
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
Nouveau has timeout = 10 seconds. AFAIK we've never seen bugs because of that. Have you seen some?
P.
On 11/25/25 09:13, Philipp Stanner wrote:
On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
On 11/25/25 08:55, Philipp Stanner wrote:
+/**
- define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
- The dma_fence object has a deep inter dependency with core memory
- management, for a detailed explanation see section DMA Fences under
- Documentation/driver-api/dma-buf.rst.
- Because of this all dma_fence implementations must guarantee that each fence
- completes in a finite time. This define here now gives a reasonable value for
- the timeout to use. It is possible to use a longer timeout in an
- implementation but that should taint the kernel.
- */
+#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
HZ can change depending on the config. Is that really a good choice? I could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
On AMD64 maybe. What about the other architectures?
HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 depending on the architecture 2*HZ is always two seconds expressed in jiffies.
The HZ define is actually there to make it architecture independent.
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
Nouveau has timeout = 10 seconds. AFAIK we've never seen bugs because of that. Have you seen some?
Thanks for that info. And to answer the question, yes certainly.
Regards, Christian.
P.
On Tue, 2025-11-25 at 09:48 +0100, Christian König wrote:
On 11/25/25 09:13, Philipp Stanner wrote:
On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
On 11/25/25 08:55, Philipp Stanner wrote:
[…]
HZ can change depending on the config. Is that really a good choice? I could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
On AMD64 maybe. What about the other architectures?
HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 depending on the architecture 2*HZ is always two seconds expressed in jiffies.
The HZ define is actually there to make it architecture independent.
<german English> Again what learned </german Enlgish>
Although the amount of documentation for such a central feature is a bit thin. Anyways. msecs_to_jiffies() is more readable, yes. Many drivers prefer it, too
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
Good and well, but then patch 4 becomes even more problematic:
So we'd just have drivers fire warnings, and then they would still have the freedom to set timeouts for drm/sched, as long as those timeouts are smaller than your new global constant.
Why then not remove drm/sched's timeout parameter API completely and always use your maximum value internally in drm/sched? Or maybe truncate it with a warning?
"Maximum timeout parameter exceeded, truncating to %ld.\n"
I suppose some drivers want even higher responsiveness than those 2 seconds.
I do believe that more of the driver folks should be made aware of this intended change.
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
I suppose some drivers are indeed only used for server hardware. And for compute you might not care about responsiveness as long as your result drops off at some point. But there's cloud gaming, too..
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
P.
On 11/25/25 11:56, Philipp Stanner wrote:
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
Good and well, but then patch 4 becomes even more problematic:
So we'd just have drivers fire warnings, and then they would still have the freedom to set timeouts for drm/sched, as long as those timeouts are smaller than your new global constant.
Why then not remove drm/sched's timeout parameter API completely and always use your maximum value internally in drm/sched? Or maybe truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
"Maximum timeout parameter exceeded, truncating to %ld.\n"
I suppose some drivers want even higher responsiveness than those 2 seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
I do believe that more of the driver folks should be made aware of this intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
I suppose some drivers are indeed only used for server hardware. And for compute you might not care about responsiveness as long as your result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
Thoughts?
Regards, Christian.
P.
+Cc Michel
On Tue, 2025-11-25 at 15:26 +0100, Christian König wrote:
On 11/25/25 11:56, Philipp Stanner wrote:
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
Good and well, but then patch 4 becomes even more problematic:
So we'd just have drivers fire warnings, and then they would still have the freedom to set timeouts for drm/sched, as long as those timeouts are smaller than your new global constant.
Why then not remove drm/sched's timeout parameter API completely and always use your maximum value internally in drm/sched? Or maybe truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
"Maximum timeout parameter exceeded, truncating to %ld.\n"
I suppose some drivers want even higher responsiveness than those 2 seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
I do believe that more of the driver folks should be made aware of this intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
Can you then please use --rfc when creating such patches in the future? That way you won't cause my heart rate to increase, searching for immediate danger :D
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
I suppose some drivers are indeed only used for server hardware. And for compute you might not care about responsiveness as long as your result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Who's the "user"? The entire system? One process sitting on top of its ioctl and file descriptor?
That question plays into answering how and where timeouts should be configured.
One might ask himself if then a kernel parameter would be the right way to configure it. I'm not very experienced with the desires of userspace.
I sumond Michel Dänzer to share his wisdom!
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
I mean, the drivers can already set it per device. It seems to me that what you actually want is finer control?
For Nouveau with its firmware scheduler having a timeout at all just doesn't make much sense anywayys.
* If a fw ring hangs, it hangs, and a shorter timeout will just have your app crash sooner. * If it's laggy and slow, it's laggy and slow, but with a high timeout at least still usable. * And if it's compute and slow, you at least get your results at some point.
But having a lower timeout wouldn't really repair anything, or am I mistaken?
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
Thoughts?
The most important thing for me regarding your RFC is that we don't add shiny warnings by declaring driver behavior invalid that was operational for years.
The most conservative way would be to send patches to the respective drivers, setting their timeouts to the new desired defaults, and then adding warnings so that future drivers become aware.
P.
Am Dienstag, dem 25.11.2025 um 15:26 +0100 schrieb Christian König:
On 11/25/25 11:56, Philipp Stanner wrote:
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
That's a question only the drivers can answer, which is why I like to think that setting global constants constraining all parties is not the right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
Good and well, but then patch 4 becomes even more problematic:
So we'd just have drivers fire warnings, and then they would still have the freedom to set timeouts for drm/sched, as long as those timeouts are smaller than your new global constant.
Why then not remove drm/sched's timeout parameter API completely and always use your maximum value internally in drm/sched? Or maybe truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
"Maximum timeout parameter exceeded, truncating to %ld.\n"
I suppose some drivers want even higher responsiveness than those 2 seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
I do believe that more of the driver folks should be made aware of this intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
What is even your motivation? What problem does this solve? Is the OOM killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
I suppose some drivers are indeed only used for server hardware. And for compute you might not care about responsiveness as long as your result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang. So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Regards, Lucas
On 11/25/25 18:02, Lucas Stach wrote:
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
I mean with the DMA engines we can trivially submit work which copies petabytes and needs hours or even a day to complete.
Without a global timeout that is a really nice deny of service attack against the system if you don't catch that.
Thanks, Christian.
Regards, Lucas
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
On 11/25/25 18:02, Lucas Stach wrote:
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
I think the long-term solution can only be fully fledged GPU scheduling with preemption. That's why we don't need such a timeout mechanism for userspace processes: the scheduler simply interrupts and lets someone else run.
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
With the current situation, IDK either. My impression so far is that letting the drivers and driver programmers decide is the least bad choice.
P.
On 11/26/25 13:37, Philipp Stanner wrote:
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
On 11/25/25 18:02, Lucas Stach wrote:
I agree that distinguishing the use case that way is not ideal. However, who has the knowledge of how the hardware is being used by customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
The question really is what you want to do after you hit the (lowered) timeout? Users get grumpy if you block things for 10 seconds, but they get equally if not more grumpy when you kick out a valid workload that just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well.
Fences are only defined to signal eventually, with no real concept of a timeout. IMO all timeouts waiting for fences should be long enough to only be considered last resort. You may want to give the user some indication of a failed fence wait instead of stalling indefinitely, but you really only want to do this after a quite long timeout, not in a sense of "Sorry, I ran out of patience after 2 seconds".
Sure memory management depends on fences making forward progress, but mm also depends on scheduled writeback making forward progress. You don't kick out writeback requests after an arbitrary timeout just because the backing storage happens to be loaded heavily.
This BTW is also why etnaviv has always had a quite short timeout of 500ms, with the option to extend the timeout when the GPU is still making progress. We don't ever want to shoot down valid workloads (we have some that need a few seconds to upload textures, etc on our wimpy GPU), but you also don't want to wait multiple seconds until you detect a real GPU hang.
That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
So we use the short scheduler timeout to check in on the GPU and see if it is still making progress (for graphics workloads by looking at the frontend position within the command buffer and current primitive ID). If we can deduce that the GPU is stuck we do the usual reset/recovery dance within a reasonable reaction time, acceptable to users hitting a real GPU hang. But if the GPU is making progress we will give an infinite number of timeout extensions with no global timeout at all, only fulfilling the eventual signaling guarantee of the fence.
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
I think the long-term solution can only be fully fledged GPU scheduling with preemption. That's why we don't need such a timeout mechanism for userspace processes: the scheduler simply interrupts and lets someone else run.
Yeah absolutely.
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Userspace basically needs to co-operate and provide a buffer where the state on a pre-emption is saved into.
With the current situation, IDK either. My impression so far is that letting the drivers and driver programmers decide is the least bad choice.
Yeah, agree. It's the least evil thing we can do.
But I now have a plan how to proceed :)
Thanks for the input, Christian.
P.
On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
On 11/26/25 13:37, Philipp Stanner wrote:
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
[…]
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
Oh oh, isn't that our dear hang_limit? :)
We agree that you can never really now whether userspace just submitted a while(true) job, don't we? Even if some GPU register still indicates "progress".
I think the long-term solution can only be fully fledged GPU scheduling with preemption. That's why we don't need such a timeout mechanism for userspace processes: the scheduler simply interrupts and lets someone else run.
Yeah absolutely.
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Do you mean new GPUs with firmware scheduling, or what is "HW pre- emption"?
With firmware interfaces, my hope would be that you could simply tell
stop_running_ring(nr_of_ring) // time slice for someone else start_running_ring(nr_of_ring)
Thereby getting real scheduling and all that. And eliminating many other problems we know well from drm/sched.
Userspace basically needs to co-operate and provide a buffer where the state on a pre-emption is saved into.
That's uncool. With CPU preemption all that is done automatically via the processe's pages.
P.
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
On 11/26/25 13:37, Philipp Stanner wrote:
On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
[…]
Well the question is how do you detect *reliable* that there is still forward progress?
My understanding is that that's impossible since the internals of command submissions are only really understood by userspace, who submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
Oh oh, isn't that our dear hang_limit? :)
Not really. The hang limit is the limit on how many times a hanging submit might be retried.
Limiting the number of timeout extensions is more of a safety net against a workloads which might appear to make progress to the kernel driver but in reality are stuck. After all, the kernel driver can only have limited knowledge of the GPU state and any progress check will have limited precision with false positives/negatives being a part of reality we have to deal with.
We agree that you can never really now whether userspace just submitted a while(true) job, don't we? Even if some GPU register still indicates "progress".
Yea, this is really hardware dependent on what you can read at runtime.
For etnaviv we define "progress" as the command frontend moving towards the end of the command buffer. As a single draw call in valid workloads can blow through our timeout we also use debug registers to look at the current primitive ID within a draw call. If userspace submits a workload that requires more than 500ms per primitive to finish we consider this an invalid workload and go through the reset/recovery motions.
Regards, Lucas
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
[...]
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Do you mean new GPUs with firmware scheduling, or what is "HW pre- emption"?
With firmware interfaces, my hope would be that you could simply tell
stop_running_ring(nr_of_ring) // time slice for someone else start_running_ring(nr_of_ring)
Thereby getting real scheduling and all that. And eliminating many other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for preemption to be a hard problem on GPUs. CPUs have limited software visible state that needs to be saved/restored on a context switch and even there people start complaining now that they need to context switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be saved/restored on the context switch if you want fine grained preemption with low preemption latency. There might be points in the command execution where you can ignore most of that state, but reaching those points can have basically unbounded latency. So either you can reliably save/restore lots of state or you are limited to very coarse grained preemption with all the usual issues of timeouts and DoS vectors. I'm not totally up to speed with the current state across all relevant GPUs, but until recently NVidia was the only vendor to have real reliable fine-grained preemption.
Regards, Lucas
On 11/26/25 17:11, Lucas Stach wrote:
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
[...]
My hope would be that in the mid-term future we'd get firmware rings that can be preempted through a firmware call for all major hardware. Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Do you mean new GPUs with firmware scheduling, or what is "HW pre- emption"?
With firmware interfaces, my hope would be that you could simply tell
stop_running_ring(nr_of_ring) // time slice for someone else start_running_ring(nr_of_ring)
Thereby getting real scheduling and all that. And eliminating many other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for preemption to be a hard problem on GPUs. CPUs have limited software visible state that needs to be saved/restored on a context switch and even there people start complaining now that they need to context switch the AVX512 register set.
Yeah, that has been discussed for the last 20 years or so when the first MMX extension came out.
GPUs have megabytes of software visible state. Which needs to be saved/restored on the context switch if you want fine grained preemption with low preemption latency. There might be points in the command execution where you can ignore most of that state, but reaching those points can have basically unbounded latency. So either you can reliably save/restore lots of state or you are limited to very coarse grained preemption with all the usual issues of timeouts and DoS vectors. I'm not totally up to speed with the current state across all relevant GPUs, but until recently NVidia was the only vendor to have real reliable fine-grained preemption.
Completely agree. You won't believe how often that is a topic in discussions.
AMD has Compute Wave Save Restore now on newer HW, but both the reliability and performance are unfortunately questionable at best.
Regards, Christian.
Regards, Lucas
The SW-sync functionality should only be used for testing and debugging since it is inherently unsave.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/sw_sync.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c index 3c20f1d31cf5..6f09d13be6b6 100644 --- a/drivers/dma-buf/sw_sync.c +++ b/drivers/dma-buf/sw_sync.c @@ -8,6 +8,7 @@ #include <linux/file.h> #include <linux/fs.h> #include <linux/uaccess.h> +#include <linux/panic.h> #include <linux/slab.h> #include <linux/sync_file.h>
@@ -349,6 +350,9 @@ static long sw_sync_ioctl_create_fence(struct sync_timeline *obj, struct sync_file *sync_file; struct sw_sync_create_fence_data data;
+ /* SW sync fence are inherently unsafe and can deadlock the kernel */ + add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); + if (fd < 0) return fd;
Hi Christian,
On Thu, 20 Nov 2025 at 20:30, Christian König ckoenig.leichtzumerken@gmail.com wrote:
The SW-sync functionality should only be used for testing and debugging since it is inherently unsave.
Thank you for this patch, LGTM.
Please feel free to add: Acked-by: Sumit Semwal sumit.semwal@linaro.org
Best, Sumit.
Signed-off-by: Christian König christian.koenig@amd.com
drivers/dma-buf/sw_sync.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c index 3c20f1d31cf5..6f09d13be6b6 100644 --- a/drivers/dma-buf/sw_sync.c +++ b/drivers/dma-buf/sw_sync.c @@ -8,6 +8,7 @@ #include <linux/file.h> #include <linux/fs.h> #include <linux/uaccess.h> +#include <linux/panic.h> #include <linux/slab.h> #include <linux/sync_file.h>
@@ -349,6 +350,9 @@ static long sw_sync_ioctl_create_fence(struct sync_timeline *obj, struct sync_file *sync_file; struct sw_sync_create_fence_data data;
/* SW sync fence are inherently unsafe and can deadlock the kernel */add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);if (fd < 0) return fd;-- 2.43.0
Instead of 10 seconds just use the reasonable maximum timeout defined by the dma_fence framework.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/vgem/vgem_fence.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/vgem/vgem_fence.c b/drivers/gpu/drm/vgem/vgem_fence.c index 07db319c3d7f..1ca14b83479d 100644 --- a/drivers/gpu/drm/vgem/vgem_fence.c +++ b/drivers/gpu/drm/vgem/vgem_fence.c @@ -27,8 +27,6 @@
#include "vgem_drv.h"
-#define VGEM_FENCE_TIMEOUT (10*HZ) - struct vgem_fence { struct dma_fence base; struct spinlock lock; @@ -81,8 +79,11 @@ static struct dma_fence *vgem_fence_create(struct vgem_file *vfile,
timer_setup(&fence->timer, vgem_fence_timeout, TIMER_IRQSAFE);
- /* We force the fence to expire within 10s to prevent driver hangs */ - mod_timer(&fence->timer, jiffies + VGEM_FENCE_TIMEOUT); + /* + * Force the fence to expire within a reasonable timeout to prevent + * hangs inside the memory management. + */ + mod_timer(&fence->timer, jiffies + DMA_FENCE_MAX_REASONABLE_TIMEOUT);
return &fence->base; }
On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
Instead of 10 seconds just use the reasonable maximum timeout defined by
It's not 10 "seconds", it's 10 "HZ"
P.
the dma_fence framework.
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/vgem/vgem_fence.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/vgem/vgem_fence.c b/drivers/gpu/drm/vgem/vgem_fence.c index 07db319c3d7f..1ca14b83479d 100644 --- a/drivers/gpu/drm/vgem/vgem_fence.c +++ b/drivers/gpu/drm/vgem/vgem_fence.c @@ -27,8 +27,6 @@ #include "vgem_drv.h" -#define VGEM_FENCE_TIMEOUT (10*HZ)
Exceeding the recommended maximum timeout should be noted in logs and crash dumps.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/scheduler/sched_main.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 1d4f1b822e7b..88e24e140def 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1318,12 +1318,22 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_ sched->ops = args->ops; sched->credit_limit = args->credit_limit; sched->name = args->name; - sched->timeout = args->timeout; sched->hang_limit = args->hang_limit; sched->timeout_wq = args->timeout_wq ? args->timeout_wq : system_percpu_wq; sched->score = args->score ? args->score : &sched->_score; sched->dev = args->dev;
+ sched->timeout = args->timeout; + if (sched->timeout > DMA_FENCE_MAX_REASONABLE_TIMEOUT) { + dev_warn(sched->dev, "Timeout %ld exceeds the maximum recommended one!\n", + sched->timeout); + /* + * Make sure that exceeding the recommendation is noted in + * logs and crash dumps. + */ + add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); + } + if (args->num_rqs > DRM_SCHED_PRIORITY_COUNT) { /* This is a gross violation--tell drivers what the problem is. */
+Cc Lyude, Danilo
On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
Exceeding the recommended maximum timeout should be noted in logs and crash dumps.
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/scheduler/sched_main.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 1d4f1b822e7b..88e24e140def 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1318,12 +1318,22 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_ sched->ops = args->ops; sched->credit_limit = args->credit_limit; sched->name = args->name;
- sched->timeout = args->timeout;
sched->hang_limit = args->hang_limit; sched->timeout_wq = args->timeout_wq ? args->timeout_wq : system_percpu_wq; sched->score = args->score ? args->score : &sched->_score; sched->dev = args->dev;
- sched->timeout = args->timeout;
- if (sched->timeout > DMA_FENCE_MAX_REASONABLE_TIMEOUT) {
dev_warn(sched->dev, "Timeout %ld exceeds the maximum recommended one!\n",sched->timeout);/** Make sure that exceeding the recommendation is noted in* logs and crash dumps.*/add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);- }
I have to NACK this in the current form, it would cause a bunch of drivers to fire warnings, despite there being absolutely nothing wrong with them in the past
https://elixir.bootlin.com/linux/v6.18-rc6/source/drivers/gpu/drm/nouveau/no... https://elixir.bootlin.com/linux/v6.18-rc6/source/drivers/gpu/drm/lima/lima_...
I guess there are more.
Nouveau's current timeout is an astonishing 10 seconds, and AFAIK there has never been a problem with that. If you want to declare this behavior invalid, you need to discuss that with the Nouveau maintainers first.
It also didn't become clear to me why dma_fence is to define a timeout rule? I like to think that "must be signalled within reasonable time" is as precise as it gets. As demonstrated by the drivers, there is just no objectively correct definiton of "reasonable".
BTW your series doesn't make clear to me why you only touch very few components: there are many more users of dma_fence than just vgem and sched. What about the others?
P.
linaro-mm-sig@lists.linaro.org