Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job

24 Sep 2025


      On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
...
On 23.09.25 14:08, Philipp Stanner wrote:
...
You know folks, situations like that are why we want to strongly
discourage accessing another API's struct members directly. There is no
API contract for them.
Indeed, please don't peek API internals. If you need additional functionality,
please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in
the long term.
...
...
And a proper API function rarely changes its interface, and if it does,
it's easy to find for the contributor where drivers need to be
adjusted. If we were all following that rule, you wouldn't even have to
bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It
      knows which entities it has killed. So that information could be
      stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of
drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
Or are you referring the broken hack in drm_sched_fini() (introduced by commit
c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is
just ignored that we need to take the entity lock as well, because it
inconviniently would lead to lock inversion?
spin_lock(&rq->lock);
    list_for_each_entry(s_entity, &rq->entities, list)
            /*
             * Prevents reinsertion and marks job_queue as idle,
             * it will be removed from the rq in drm_sched_entity_fini()
             * eventually
             */
            s_entity->stopped = true;
    spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity
    is released and entity's job_queue not empty I encountred
    a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown
ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do
we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still
isn't a justification for building on top of a broken workaround. :(

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job