On 23.09.25 15:10, Philipp Stanner wrote:
On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote: > From: Tvrtko Ursulin tvrtko.ursulin@igalia.com > > commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream. > > In FIFO mode (which is the default), both drm_sched_entity_push_job() and > drm_sched_rq_update_fifo(), where the latter calls the former, are > currently taking and releasing the same entity->rq_lock. > > We can avoid that design inelegance, and also have a miniscule > efficiency improvement on the submit from idle path, by introducing a new > drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to > its callers. > > v2: > * Remove drm_sched_rq_update_fifo() altogether. (Christian) > > v3: > * Improved commit message. (Philipp) > > Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com > Cc: Christian König christian.koenig@amd.com > Cc: Alex Deucher alexander.deucher@amd.com > Cc: Luben Tuikov ltuikov89@gmail.com > Cc: Matthew Brost matthew.brost@intel.com > Cc: Philipp Stanner pstanner@redhat.com > Reviewed-by: Christian König christian.koenig@amd.com > Signed-off-by: Philipp Stanner pstanner@redhat.com > Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... > (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) > Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens when you either flush() or fini() the entity. OR if you run into that drm_sched_fini() race.
Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether drm_sched_entity_fini() or drm_sched_entity_flush() have been called on those entities already?
No and it should not.
drm_entity_flush() can be called many times (and even concurrently) on the same entity.
Only when the scheduler sees that it is called by the last submitter of jobs *and* because this submitter was terminated by a SIGKILL then the entity is killed as well.
Background is that the file flush callback this is used with is basically called all the time. And as we now have found once more also because userspace forgotten to set CLOEXEC.
B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's not really any benefit in potentially blowing up drivers by removing it (well, technically blowing up drivers like that would reveal significant lifetime and, thus, design issues. But it wouldn't be "nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being violated :(
Yeah, I know :(
Regards, Christian.
P.
Regards, Christian.
P.
Best, Jules