From: Christian König christian.koenig@amd.com
[ Upstream commit d7ddcf921e7d0d8ebe82e89635bc9dc26ba9540d ]
Gang submission means that the kernel driver guarantees that multiple submissions are executed on the HW at the same time on different engines.
Background is that those submissions then depend on each other and each can't finish stand alone.
SRIOV now uses world switch to preempt submissions on the engines to allow sharing the HW resources between multiple VFs.
The problem is now that the SRIOV world switch can't know about such inter dependencies and will cause a timeout if it waits for a partially running gang submission.
To conclude SRIOV and gang submissions are fundamentally incompatible at the moment. For now just disable them.
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES – Rejecting gang submissions on SR-IOV VFs prevents real GPU hangs/timeouts without touching any other paths.
**Why Backport** - SR-IOV world switching can preempt only part of a gang submission, so one engine waits forever on the others and the VF times out; the new guard rejects those multi-entity submissions up front. The fix is a single check added to `amdgpu_cs_pass1()` that returns `-EINVAL` when `p->gang_size > 1` on a VF (`drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:273`). - `amdgpu_sriov_vf()` already exists and is widely used; the change is contained and triggered only in the broken configuration.
**Risk** - Behavioural change is limited to SR-IOV VFs; bare-metal and PF paths stay identical. - Users that attempted gang submit on a VF now get a clean `-EINVAL` instead of a GPU hang. That feature never functioned correctly in this mode, so the regression risk is minimal compared to the current failure mode.
**Next Steps** - Backport the guard; no additional prerequisites are needed. Consider also backporting the accompanying IDS flag patch so user space can detect gang-submit availability.
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index d541e214a18c8..1ce1fd0c87a57 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -286,7 +286,7 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p, } }
- if (!p->gang_size) { + if (!p->gang_size || (amdgpu_sriov_vf(p->adev) && p->gang_size > 1)) { ret = -EINVAL; goto free_all_kdata; }