On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
Indeed, please don't peek API internals. If you need additional functionality, please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in the long term.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
Or are you referring the broken hack in drm_sched_fini() (introduced by commit c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is just ignored that we need to take the entity lock as well, because it inconviniently would lead to lock inversion?
spin_lock(&rq->lock); list_for_each_entry(s_entity, &rq->entities, list) /* * Prevents reinsertion and marks job_queue as idle, * it will be removed from the rq in drm_sched_entity_fini() * eventually */ s_entity->stopped = true; spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still isn't a justification for building on top of a broken workaround. :(