Docs for struct dma_resv are fairly clear:
"A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)."
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-obj...
Furthermore a review across all of upstream.
First of render drivers and how they set implicit fences:
- nouveau follows this contract, see in validate_fini_no_ticket()
nouveau_bo_fence(nvbo, fence, !!b->write_domains);
and that last boolean controls whether the exclusive or shared fence slot is used.
- radeon follows this contract by setting
p->relocs[i].tv.num_shared = !r->write_domain;
in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing.
- vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot.
- etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences()
- i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract.
- lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo
- msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
- panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess
- v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved
- v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos()
- vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu
- vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence()
This covers the setting of the exclusive fences when writing.
Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few:
- i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan)
- etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback.
- v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small.
- v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up.
- panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper.
- lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d.
- msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
- all the remaining drivers are ttm based, where I hope they do appropriately obey implicit fences already. I didn't do the full audit there because a) not follow the contract would confuse ttm quite well and b) reading non-standard scheduler and submit code which isn't based on drm/scheduler is a pain.
Onwards to the display side.
- Any driver using the drm_gem_plane_helper_prepare_fb() helper will correctly. Overwhelmingly most drivers get this right, except a few totally dont. I'll follow up with a patch to make this the default and avoid a bunch of bugs.
- I didn't audit the ttm drivers, but given that dma_resv started there I hope they get this right.
In conclusion this IS the contract, both as documented and overwhelmingly implemented, specically as implemented by all render drivers except amdgpu.
Amdgpu tried to fix this already in
commit 049aca4363d8af87cab8d53de5401602db3b9999 Author: Christian König christian.koenig@amd.com Date: Wed Sep 19 16:54:35 2018 +0200
drm/amdgpu: fix using shared fence for exported BOs v2
but this fix falls short on a number of areas:
- It's racy, by the time the buffer is shared it might be too late. To make sure there's definitely never a problem we need to set the fences correctly for any buffer that's potentially exportable.
- It's breaking uapi, dma-buf fds support poll() and differentitiate between, which was introduced in
commit 9b495a5887994a6d74d5c261d012083a92b94738 Author: Maarten Lankhorst maarten.lankhorst@canonical.com Date: Tue Jul 1 12:57:43 2014 +0200
dma-buf: add poll support, v3
- Christian König wants to nack new uapi building further on this dma_resv contract because it breaks amdgpu, quoting
"Yeah, and that is exactly the reason why I will NAK this uAPI change.
"This doesn't works for amdgpu at all for the reasons outlined above."
https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail...
Rejecting new development because your own driver is broken and violates established cross driver contracts and uapi is really not how upstream works.
Now this patch will have a severe performance impact on anything that runs on multiple engines. So we can't just merge it outright, but need a bit a plan:
- amdgpu needs a proper uapi for handling implicit fencing. The funny thing is that to do it correctly, implicit fencing must be treated as a very strange IPC mechanism for transporting fences, where both setting the fence and dependency intercepts must be handled explicitly. Current best practices is a per-bo flag to indicate writes, and a per-bo flag to to skip implicit fencing in the CS ioctl as a new chunk.
- Since amdgpu has been shipping with broken behaviour we need an opt-out flag from the butchered implicit fencing model to enable the proper explicit implicit fencing model.
- for kernel memory fences due to bo moves at least the i915 idea is to use ttm_bo->moving. amdgpu probably needs the same.
- since the current p2p dma-buf interface assumes the kernel memory fence is in the exclusive dma_resv fence slot we need to add a new fence slot for kernel fences, which must never be ignored. Since currently only amdgpu supports this there's no real problem here yet, until amdgpu gains a NO_IMPLICIT CS flag.
- New userspace needs to ship in enough desktop distros so that users wont notice the perf impact. I think we can ignore LTS distros who upgrade their kernels but not their mesa3d snapshot.
- Then when this is all in place we can merge this patch here.
What is not a solution to this problem here is trying to make the dma_resv rules in the kernel more clever. The fundamental issue here is that the amdgpu CS uapi is the least expressive one across all drivers (only equalled by panfrost, which has an actual excuse) by not allowing any userspace control over how implicit sync is conducted.
Until this is fixed it's completely pointless to make the kernel more clever to improve amdgpu, because all we're doing is papering over this uapi design issue. amdgpu needs to attain the status quo established by other drivers first, once that's achieved we can tackle the remaining issues in a consistent way across drivers.
Cc: mesa-dev@lists.freedesktop.org Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Dave Airlie airlied@gmail.com Cc: Rob Clark robdclark@chromium.org Cc: Kristian H. Kristensen hoegsberg@google.com Cc: Michel Dänzer michel@daenzer.net Cc: Daniel Stone daniels@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Deepak R Varma mh12gx2825@gmail.com Cc: Chen Li chenli@uniontech.com Cc: Kevin Wang kevin1.wang@amd.com Cc: Dennis Li Dennis.Li@amd.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
- /* Make sure we use the exclusive slot for shared BOs */ - if (bo->prime_shared_count) + /* Make sure we use the exclusive slot for all potentially shared BOs */ + if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0; e->bo_va = amdgpu_vm_bo_find(vm, bo); }
More consistency and prep work for the next patch.
Aside: I wonder whether we shouldn't just move this entire xarray business into the scheduler so that not everyone has to reinvent the same wheels. Cc'ing some scheduler people for this too.
Cc: "Christian König" christian.koenig@amd.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Lee Jones lee.jones@linaro.org Cc: Steven Price steven.price@arm.com Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 41 ++++++++--------- drivers/gpu/drm/panfrost/panfrost_job.c | 61 ++++++++++--------------- drivers/gpu/drm/panfrost/panfrost_job.h | 8 ++-- 3 files changed, 46 insertions(+), 64 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index ca07098a6141..7977b4752b5c 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -137,12 +137,6 @@ panfrost_lookup_bos(struct drm_device *dev, if (!job->bo_count) return 0;
- job->implicit_fences = kvmalloc_array(job->bo_count, - sizeof(struct dma_fence *), - GFP_KERNEL | __GFP_ZERO); - if (!job->implicit_fences) - return -ENOMEM; - ret = drm_gem_objects_lookup(file_priv, (void __user *)(uintptr_t)args->bo_handles, job->bo_count, &job->bos); @@ -173,7 +167,7 @@ panfrost_lookup_bos(struct drm_device *dev, }
/** - * panfrost_copy_in_sync() - Sets up job->in_fences[] with the sync objects + * panfrost_copy_in_sync() - Sets up job->deps with the sync objects * referenced by the job. * @dev: DRM device * @file_priv: DRM file for this fd @@ -193,22 +187,14 @@ panfrost_copy_in_sync(struct drm_device *dev, { u32 *handles; int ret = 0; - int i; + int i, in_fence_count;
- job->in_fence_count = args->in_sync_count; + in_fence_count = args->in_sync_count;
- if (!job->in_fence_count) + if (!in_fence_count) return 0;
- job->in_fences = kvmalloc_array(job->in_fence_count, - sizeof(struct dma_fence *), - GFP_KERNEL | __GFP_ZERO); - if (!job->in_fences) { - DRM_DEBUG("Failed to allocate job in fences\n"); - return -ENOMEM; - } - - handles = kvmalloc_array(job->in_fence_count, sizeof(u32), GFP_KERNEL); + handles = kvmalloc_array(in_fence_count, sizeof(u32), GFP_KERNEL); if (!handles) { ret = -ENOMEM; DRM_DEBUG("Failed to allocate incoming syncobj handles\n"); @@ -217,16 +203,23 @@ panfrost_copy_in_sync(struct drm_device *dev,
if (copy_from_user(handles, (void __user *)(uintptr_t)args->in_syncs, - job->in_fence_count * sizeof(u32))) { + in_fence_count * sizeof(u32))) { ret = -EFAULT; DRM_DEBUG("Failed to copy in syncobj handles\n"); goto fail; }
- for (i = 0; i < job->in_fence_count; i++) { + for (i = 0; i < in_fence_count; i++) { + struct dma_fence *fence; + ret = drm_syncobj_find_fence(file_priv, handles[i], 0, 0, - &job->in_fences[i]); - if (ret == -EINVAL) + &fence); + if (ret) + goto fail; + + ret = drm_gem_fence_array_add(&job->deps, fence); + + if (ret) goto fail; }
@@ -264,6 +257,8 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
kref_init(&job->refcount);
+ xa_init_flags(&job->deps, XA_FLAGS_ALLOC); + job->pfdev = pfdev; job->jc = args->jc; job->requirements = args->requirements; diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index f5d39ee14ab5..707d912ff64a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -196,14 +196,21 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START); }
-static void panfrost_acquire_object_fences(struct drm_gem_object **bos, - int bo_count, - struct dma_fence **implicit_fences) +static int panfrost_acquire_object_fences(struct drm_gem_object **bos, + int bo_count, + struct xarray *deps) { - int i; + int i, ret;
- for (i = 0; i < bo_count; i++) - implicit_fences[i] = dma_resv_get_excl_rcu(bos[i]->resv); + for (i = 0; i < bo_count; i++) { + struct dma_fence *fence = dma_resv_get_excl_rcu(bos[i]->resv); + + ret = drm_gem_fence_array_add(deps, fence); + if (ret) + return ret; + } + + return 0; }
static void panfrost_attach_object_fences(struct drm_gem_object **bos, @@ -236,8 +243,10 @@ int panfrost_job_push(struct panfrost_job *job)
kref_get(&job->refcount); /* put by scheduler job completion */
- panfrost_acquire_object_fences(job->bos, job->bo_count, - job->implicit_fences); + ret = panfrost_acquire_object_fences(job->bos, job->bo_count, + &job->deps); + if (ret) + goto unlock;
drm_sched_entity_push_job(&job->base, entity);
@@ -254,18 +263,15 @@ static void panfrost_job_cleanup(struct kref *ref) { struct panfrost_job *job = container_of(ref, struct panfrost_job, refcount); + struct dma_fence *fence; + unsigned long index; unsigned int i;
- if (job->in_fences) { - for (i = 0; i < job->in_fence_count; i++) - dma_fence_put(job->in_fences[i]); - kvfree(job->in_fences); - } - if (job->implicit_fences) { - for (i = 0; i < job->bo_count; i++) - dma_fence_put(job->implicit_fences[i]); - kvfree(job->implicit_fences); + xa_for_each(&job->deps, index, fence) { + dma_fence_put(fence); } + xa_destroy(&job->deps); + dma_fence_put(job->done_fence); dma_fence_put(job->render_done_fence);
@@ -308,26 +314,9 @@ static struct dma_fence *panfrost_job_dependency(struct drm_sched_job *sched_job struct drm_sched_entity *s_entity) { struct panfrost_job *job = to_panfrost_job(sched_job); - struct dma_fence *fence; - unsigned int i;
- /* Explicit fences */ - for (i = 0; i < job->in_fence_count; i++) { - if (job->in_fences[i]) { - fence = job->in_fences[i]; - job->in_fences[i] = NULL; - return fence; - } - } - - /* Implicit fences, max. one per BO */ - for (i = 0; i < job->bo_count; i++) { - if (job->implicit_fences[i]) { - fence = job->implicit_fences[i]; - job->implicit_fences[i] = NULL; - return fence; - } - } + if (!xa_empty(&job->deps)) + return xa_erase(&job->deps, job->last_dep++);
return NULL; } diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h index bbd3ba97ff67..82306a03b57e 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.h +++ b/drivers/gpu/drm/panfrost/panfrost_job.h @@ -19,9 +19,9 @@ struct panfrost_job { struct panfrost_device *pfdev; struct panfrost_file_priv *file_priv;
- /* Optional fences userspace can pass in for the job to depend on. */ - struct dma_fence **in_fences; - u32 in_fence_count; + /* Contains both explicit and implicit fences */ + struct xarray deps; + unsigned long last_dep;
/* Fence to be signaled by IRQ handler when the job is complete. */ struct dma_fence *done_fence; @@ -30,8 +30,6 @@ struct panfrost_job { __u32 requirements; __u32 flush_id;
- /* Exclusive fences we have taken from the BOs to wait for */ - struct dma_fence **implicit_fences; struct panfrost_gem_mapping **mappings; struct drm_gem_object **bos; u32 bo_count;
Currently this has no practial relevance I think because there's not many who can pull off a setup with panfrost and another gpu in the same system. But the rules are that if you're setting an exclusive fence, indicating a gpu write access in the implicit fencing system, then you need to wait for all fences, not just the previous exclusive fence.
panfrost against itself has no problem, because it always sets the exclusive fence (but that's probably something that will need to be fixed for vulkan and/or multi-engine gpus, or you'll suffer badly). Also no problem with that against display.
With the prep work done to switch over to the dependency helpers this is now a oneliner.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Steven Price steven.price@arm.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org --- drivers/gpu/drm/panfrost/panfrost_job.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 707d912ff64a..619d6104040c 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -203,9 +203,8 @@ static int panfrost_acquire_object_fences(struct drm_gem_object **bos, int i, ret;
for (i = 0; i < bo_count; i++) { - struct dma_fence *fence = dma_resv_get_excl_rcu(bos[i]->resv); - - ret = drm_gem_fence_array_add(deps, fence); + /* panfrost always uses write mode in its current uapi */ + ret = drm_gem_fence_array_add_implicit(deps, bos[i], true); if (ret) return ret; }
Hi,
On Fri, 21 May 2021 at 10:10, Daniel Vetter daniel.vetter@ffwll.ch wrote:
Currently this has no practial relevance I think because there's not many who can pull off a setup with panfrost and another gpu in the same system. But the rules are that if you're setting an exclusive fence, indicating a gpu write access in the implicit fencing system, then you need to wait for all fences, not just the previous exclusive fence.
panfrost against itself has no problem, because it always sets the exclusive fence (but that's probably something that will need to be fixed for vulkan and/or multi-engine gpus, or you'll suffer badly). Also no problem with that against display.
Yeah, the 'second-generation Valhall' GPUs coming later this year / early next year are starting to get pretty weird. Firmware-mediated job scheduling out of multiple queues, userspace having direct access to the queues and can do inter-queue synchronisation (at least I think so), etc. For bonus points, synchronisation is based on $addr = $val to signal and $addr == $val to wait, with a separate fence primitive as well.
Obviously Arm should be part of this conversation here, but I guess we'll have to wait for a while yet to see how everything's shaken out with this new gen, and hope that whatever's been designed upstream in the meantime is actually vaguely compatible ...
Cheers, Daniel
Am 21.05.21 um 14:22 schrieb Daniel Stone:
Hi,
On Fri, 21 May 2021 at 10:10, Daniel Vetter daniel.vetter@ffwll.ch wrote:
Currently this has no practial relevance I think because there's not many who can pull off a setup with panfrost and another gpu in the same system. But the rules are that if you're setting an exclusive fence, indicating a gpu write access in the implicit fencing system, then you need to wait for all fences, not just the previous exclusive fence.
panfrost against itself has no problem, because it always sets the exclusive fence (but that's probably something that will need to be fixed for vulkan and/or multi-engine gpus, or you'll suffer badly). Also no problem with that against display.
Yeah, the 'second-generation Valhall' GPUs coming later this year / early next year are starting to get pretty weird. Firmware-mediated job scheduling out of multiple queues, userspace having direct access to the queues and can do inter-queue synchronisation (at least I think so), etc. For bonus points, synchronisation is based on $addr = $val to signal and $addr == $val to wait, with a separate fence primitive as well.
Well that sounds familiar :)
Obviously Arm should be part of this conversation here, but I guess we'll have to wait for a while yet to see how everything's shaken out with this new gen, and hope that whatever's been designed upstream in the meantime is actually vaguely compatible ...
Yeah, going to keep you in CC when we start to code and review user fences.
Cheers, Christian.
Cheers, Daniel _______________________________________________ Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
On Fri, 21 May 2021 at 13:28, Christian König christian.koenig@amd.com wrote:
Am 21.05.21 um 14:22 schrieb Daniel Stone:
Yeah, the 'second-generation Valhall' GPUs coming later this year / early next year are starting to get pretty weird. Firmware-mediated job scheduling out of multiple queues, userspace having direct access to the queues and can do inter-queue synchronisation (at least I think so), etc. For bonus points, synchronisation is based on $addr = $val to signal and $addr == $val to wait, with a separate fence primitive as well.
Well that sounds familiar :)
I laughed when I first saw it, because it was better than crying I guess.
If you're curious, the interface definitions are in the csf/ directory in the 'Bifrost kernel driver' r30p0 download you can get from the Arm developer site. Unfortunately the exact semantics aren't completely clear.
Obviously Arm should be part of this conversation here, but I guess we'll have to wait for a while yet to see how everything's shaken out with this new gen, and hope that whatever's been designed upstream in the meantime is actually vaguely compatible ...
Yeah, going to keep you in CC when we start to code and review user fences.
Awesome, thanks Christian. Appreciate it. :)
Cheers, Daniel
Am 21.05.21 um 14:54 schrieb Daniel Stone:
On Fri, 21 May 2021 at 13:28, Christian König christian.koenig@amd.com wrote:
Am 21.05.21 um 14:22 schrieb Daniel Stone:
Yeah, the 'second-generation Valhall' GPUs coming later this year / early next year are starting to get pretty weird. Firmware-mediated job scheduling out of multiple queues, userspace having direct access to the queues and can do inter-queue synchronisation (at least I think so), etc. For bonus points, synchronisation is based on $addr = $val to signal and $addr == $val to wait, with a separate fence primitive as well.
Well that sounds familiar :)
I laughed when I first saw it, because it was better than crying I guess.
In Germany we say "Ich freue mich drauf wie auf Zahnschmerzen".
If you're curious, the interface definitions are in the csf/ directory in the 'Bifrost kernel driver' r30p0 download you can get from the Arm developer site. Unfortunately the exact semantics aren't completely clear.
Well it is actually relatively simple. Take a look at the timeline semaphores from Vulkan, everybody is basically implementing the same semantics now.
When you queued up a bunch of commands on your hardware, the first one will write value 1 to a 64bit memory location, the second one will write value 2, the third value 3 and so on. After writing the value the hardware raises and interrupt signal to everybody interested.
In other words pretty standard memory fence behavior.
When you now have a second queue which depends on work of the first one you look at the memory location and do a compare. If you depend on the third submission you just wait for the value to be >3 and are done.
Regards, Christian.
Obviously Arm should be part of this conversation here, but I guess we'll have to wait for a while yet to see how everything's shaken out with this new gen, and hope that whatever's been designed upstream in the meantime is actually vaguely compatible ...
Yeah, going to keep you in CC when we start to code and review user fences.
Awesome, thanks Christian. Appreciate it. :)
Cheers, Daniel
On Fri, 21 May 2021 at 14:09, Christian König christian.koenig@amd.com wrote:
Am 21.05.21 um 14:54 schrieb Daniel Stone:
If you're curious, the interface definitions are in the csf/ directory in the 'Bifrost kernel driver' r30p0 download you can get from the Arm developer site. Unfortunately the exact semantics aren't completely clear.
Well it is actually relatively simple. Take a look at the timeline semaphores from Vulkan, everybody is basically implementing the same semantics now.
When you queued up a bunch of commands on your hardware, the first one will write value 1 to a 64bit memory location, the second one will write value 2, the third value 3 and so on. After writing the value the hardware raises and interrupt signal to everybody interested.
In other words pretty standard memory fence behavior.
When you now have a second queue which depends on work of the first one you look at the memory location and do a compare. If you depend on the third submission you just wait for the value to be >3 and are done.
Right, it is clearly defined to the timeline semaphore semantics, I just meant that it's not clear how it works at a lower level wrt the synchronisation and signaling. The simplest possible interpretation is that wait_addrval blocks infinitely before kick-cmdbuf, but that seems painful with only 32 queues. And the same for fences, which are a binary signal. I guess we'll find out. My tooth hurts.
Cheers, Daniel
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Docs for struct dma_resv are fairly clear:
"A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)."
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-obj...
Furthermore a review across all of upstream.
First of render drivers and how they set implicit fences:
nouveau follows this contract, see in validate_fini_no_ticket()
nouveau_bo_fence(nvbo, fence, !!b->write_domains);
and that last boolean controls whether the exclusive or shared fence slot is used.
radeon follows this contract by setting
p->relocs[i].tv.num_shared = !r->write_domain;
in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing.
vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot.
etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences()
i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract.
lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo
msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess
v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved
v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos()
vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu
vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence()
This covers the setting of the exclusive fences when writing.
Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few:
i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan)
etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback.
v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small.
v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up.
panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper.
lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d.
msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
all the remaining drivers are ttm based, where I hope they do appropriately obey implicit fences already. I didn't do the full audit there because a) not follow the contract would confuse ttm quite well and b) reading non-standard scheduler and submit code which isn't based on drm/scheduler is a pain.
Onwards to the display side.
Any driver using the drm_gem_plane_helper_prepare_fb() helper will correctly. Overwhelmingly most drivers get this right, except a few totally dont. I'll follow up with a patch to make this the default and avoid a bunch of bugs.
I didn't audit the ttm drivers, but given that dma_resv started there I hope they get this right.
In conclusion this IS the contract, both as documented and overwhelmingly implemented, specically as implemented by all render drivers except amdgpu.
Amdgpu tried to fix this already in
commit 049aca4363d8af87cab8d53de5401602db3b9999 Author: Christian König christian.koenig@amd.com Date: Wed Sep 19 16:54:35 2018 +0200
drm/amdgpu: fix using shared fence for exported BOs v2
but this fix falls short on a number of areas:
It's racy, by the time the buffer is shared it might be too late. To make sure there's definitely never a problem we need to set the fences correctly for any buffer that's potentially exportable.
It's breaking uapi, dma-buf fds support poll() and differentitiate between, which was introduced in
commit 9b495a5887994a6d74d5c261d012083a92b94738 Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Tue Jul 1 12:57:43 2014 +0200 dma-buf: add poll support, v3
Christian König wants to nack new uapi building further on this dma_resv contract because it breaks amdgpu, quoting
"Yeah, and that is exactly the reason why I will NAK this uAPI change.
"This doesn't works for amdgpu at all for the reasons outlined above."
https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail...
Rejecting new development because your own driver is broken and violates established cross driver contracts and uapi is really not how upstream works.
Now this patch will have a severe performance impact on anything that runs on multiple engines. So we can't just merge it outright, but need a bit a plan:
amdgpu needs a proper uapi for handling implicit fencing. The funny thing is that to do it correctly, implicit fencing must be treated as a very strange IPC mechanism for transporting fences, where both setting the fence and dependency intercepts must be handled explicitly. Current best practices is a per-bo flag to indicate writes, and a per-bo flag to to skip implicit fencing in the CS ioctl as a new chunk.
Since amdgpu has been shipping with broken behaviour we need an opt-out flag from the butchered implicit fencing model to enable the proper explicit implicit fencing model.
for kernel memory fences due to bo moves at least the i915 idea is to use ttm_bo->moving. amdgpu probably needs the same.
since the current p2p dma-buf interface assumes the kernel memory fence is in the exclusive dma_resv fence slot we need to add a new fence slot for kernel fences, which must never be ignored. Since currently only amdgpu supports this there's no real problem here yet, until amdgpu gains a NO_IMPLICIT CS flag.
New userspace needs to ship in enough desktop distros so that users wont notice the perf impact. I think we can ignore LTS distros who upgrade their kernels but not their mesa3d snapshot.
Then when this is all in place we can merge this patch here.
What is not a solution to this problem here is trying to make the dma_resv rules in the kernel more clever. The fundamental issue here is that the amdgpu CS uapi is the least expressive one across all drivers (only equalled by panfrost, which has an actual excuse) by not allowing any userspace control over how implicit sync is conducted.
Until this is fixed it's completely pointless to make the kernel more clever to improve amdgpu, because all we're doing is papering over this uapi design issue. amdgpu needs to attain the status quo established by other drivers first, once that's achieved we can tackle the remaining issues in a consistent way across drivers.
Cc: mesa-dev@lists.freedesktop.org Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Dave Airlie airlied@gmail.com Cc: Rob Clark robdclark@chromium.org Cc: Kristian H. Kristensen hoegsberg@google.com Cc: Michel Dänzer michel@daenzer.net Cc: Daniel Stone daniels@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Deepak R Varma mh12gx2825@gmail.com Cc: Chen Li chenli@uniontech.com Cc: Kevin Wang kevin1.wang@amd.com Cc: Dennis Li Dennis.Li@amd.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Docs for struct dma_resv are fairly clear:
"A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)."
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-obj...
Furthermore a review across all of upstream.
First of render drivers and how they set implicit fences:
nouveau follows this contract, see in validate_fini_no_ticket()
nouveau_bo_fence(nvbo, fence, !!b->write_domains);
and that last boolean controls whether the exclusive or shared fence slot is used.
radeon follows this contract by setting
p->relocs[i].tv.num_shared = !r->write_domain;
in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing.
vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot.
etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences()
i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract.
lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo
msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess
v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved
v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos()
vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu
vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence()
This covers the setting of the exclusive fences when writing.
Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few:
i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan)
etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback.
v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small.
v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up.
panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper.
lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d.
msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
all the remaining drivers are ttm based, where I hope they do appropriately obey implicit fences already. I didn't do the full audit there because a) not follow the contract would confuse ttm quite well and b) reading non-standard scheduler and submit code which isn't based on drm/scheduler is a pain.
Onwards to the display side.
Any driver using the drm_gem_plane_helper_prepare_fb() helper will correctly. Overwhelmingly most drivers get this right, except a few totally dont. I'll follow up with a patch to make this the default and avoid a bunch of bugs.
I didn't audit the ttm drivers, but given that dma_resv started there I hope they get this right.
In conclusion this IS the contract, both as documented and overwhelmingly implemented, specically as implemented by all render drivers except amdgpu.
Amdgpu tried to fix this already in
commit 049aca4363d8af87cab8d53de5401602db3b9999 Author: Christian König christian.koenig@amd.com Date: Wed Sep 19 16:54:35 2018 +0200
drm/amdgpu: fix using shared fence for exported BOs v2
but this fix falls short on a number of areas:
It's racy, by the time the buffer is shared it might be too late. To make sure there's definitely never a problem we need to set the fences correctly for any buffer that's potentially exportable.
It's breaking uapi, dma-buf fds support poll() and differentitiate between, which was introduced in
commit 9b495a5887994a6d74d5c261d012083a92b94738 Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Tue Jul 1 12:57:43 2014 +0200 dma-buf: add poll support, v3
Christian König wants to nack new uapi building further on this dma_resv contract because it breaks amdgpu, quoting
"Yeah, and that is exactly the reason why I will NAK this uAPI change.
"This doesn't works for amdgpu at all for the reasons outlined above."
https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail...
Rejecting new development because your own driver is broken and violates established cross driver contracts and uapi is really not how upstream works.
Now this patch will have a severe performance impact on anything that runs on multiple engines. So we can't just merge it outright, but need a bit a plan:
amdgpu needs a proper uapi for handling implicit fencing. The funny thing is that to do it correctly, implicit fencing must be treated as a very strange IPC mechanism for transporting fences, where both setting the fence and dependency intercepts must be handled explicitly. Current best practices is a per-bo flag to indicate writes, and a per-bo flag to to skip implicit fencing in the CS ioctl as a new chunk.
Since amdgpu has been shipping with broken behaviour we need an opt-out flag from the butchered implicit fencing model to enable the proper explicit implicit fencing model.
for kernel memory fences due to bo moves at least the i915 idea is to use ttm_bo->moving. amdgpu probably needs the same.
since the current p2p dma-buf interface assumes the kernel memory fence is in the exclusive dma_resv fence slot we need to add a new fence slot for kernel fences, which must never be ignored. Since currently only amdgpu supports this there's no real problem here yet, until amdgpu gains a NO_IMPLICIT CS flag.
New userspace needs to ship in enough desktop distros so that users wont notice the perf impact. I think we can ignore LTS distros who upgrade their kernels but not their mesa3d snapshot.
Then when this is all in place we can merge this patch here.
What is not a solution to this problem here is trying to make the dma_resv rules in the kernel more clever. The fundamental issue here is that the amdgpu CS uapi is the least expressive one across all drivers (only equalled by panfrost, which has an actual excuse) by not allowing any userspace control over how implicit sync is conducted.
Until this is fixed it's completely pointless to make the kernel more clever to improve amdgpu, because all we're doing is papering over this uapi design issue. amdgpu needs to attain the status quo established by other drivers first, once that's achieved we can tackle the remaining issues in a consistent way across drivers.
Cc: mesa-dev@lists.freedesktop.org Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Dave Airlie airlied@gmail.com Cc: Rob Clark robdclark@chromium.org Cc: Kristian H. Kristensen hoegsberg@google.com Cc: Michel Dänzer michel@daenzer.net Cc: Daniel Stone daniels@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Deepak R Varma mh12gx2825@gmail.com Cc: Chen Li chenli@uniontech.com Cc: Kevin Wang kevin1.wang@amd.com Cc: Dennis Li Dennis.Li@amd.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
Ah yes, I missed this entirely. So the "no implicit flag" is already there, and the _only_ thing that's missing really is a way to fish out the implicit fences, and set them.
https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.n...
So I think all that's really needed in radv is not setting RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl are present (means you need to do some import/export and keep the fd around for winsys buffers, but shouldn't be too bad), and then control the implicit fences entirely explicitly like vk expects.
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
On Fri, May 21, 2021 at 4:37 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
Ah yes, I missed this entirely. So the "no implicit flag" is already there, and the _only_ thing that's missing really is a way to fish out the implicit fences, and set them.
https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.n...
So I think all that's really needed in radv is not setting RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl are present (means you need to do some import/export and keep the fd around for winsys buffers, but shouldn't be too bad), and then control the implicit fences entirely explicitly like vk expects.
That is the part I'm less sure about. This is a BO wide flag so we are also disabling implicit sync in the compositor. If the compositor does only do read stuff that is ok, as the inserted exclusive fence will work for that. But as I learned recently the app provided buffer may end up being written to by the X server which open a whole can of potential problems if implicit sync gets disabled between Xserver operations on the app provided buffer. Hence setting that on the WSI buffer is a whole new can of potential problems and hence I've said a submission based flag would be preferred.
I can certainly try it out though.
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 4:37 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
Ah yes, I missed this entirely. So the "no implicit flag" is already there, and the _only_ thing that's missing really is a way to fish out the implicit fences, and set them.
https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.n...
So I think all that's really needed in radv is not setting RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl are present (means you need to do some import/export and keep the fd around for winsys buffers, but shouldn't be too bad), and then control the implicit fences entirely explicitly like vk expects.
That is the part I'm less sure about. This is a BO wide flag so we are also disabling implicit sync in the compositor. If the compositor does only do read stuff that is ok, as the inserted exclusive fence will work for that. But as I learned recently the app provided buffer may end up being written to by the X server which open a whole can of potential problems if implicit sync gets disabled between Xserver operations on the app provided buffer. Hence setting that on the WSI buffer is a whole new can of potential problems and hence I've said a submission based flag would be preferred.
I can certainly try it out though.
Hm yeah that's the wrong flag. We need a flag on the drm_file which the explicit userspace sets. And which is valid only for itself.
There's a nice flags field when creating a ctx, but it's not validated and there's already a comment that we have to filter out garbage priority, so that's not use. I'll whip up something entirely untested just as a draft. -Daniel
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 21.05.21 um 17:16 schrieb Daniel Vetter:
On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 4:37 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
Ah yes, I missed this entirely. So the "no implicit flag" is already there, and the _only_ thing that's missing really is a way to fish out the implicit fences, and set them.
https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.n...
So I think all that's really needed in radv is not setting RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl are present (means you need to do some import/export and keep the fd around for winsys buffers, but shouldn't be too bad), and then control the implicit fences entirely explicitly like vk expects.
That is the part I'm less sure about. This is a BO wide flag so we are also disabling implicit sync in the compositor. If the compositor does only do read stuff that is ok, as the inserted exclusive fence will work for that. But as I learned recently the app provided buffer may end up being written to by the X server which open a whole can of potential problems if implicit sync gets disabled between Xserver operations on the app provided buffer. Hence setting that on the WSI buffer is a whole new can of potential problems and hence I've said a submission based flag would be preferred.
I can certainly try it out though.
Hm yeah that's the wrong flag. We need a flag on the drm_file which the explicit userspace sets. And which is valid only for itself.
There's a nice flags field when creating a ctx, but it's not validated and there's already a comment that we have to filter out garbage priority, so that's not use. I'll whip up something entirely untested just as a draft.
We could provide an IOCTL for the BO to change the flag.
But could we first figure out the semantics we want to use here?
Cause I'm pretty sure we don't actually need those changes at all and as said before I'm certainly NAKing things which break existing use cases.
Regards, Christian.
-Daniel
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, May 21, 2021 at 8:08 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 21.05.21 um 17:16 schrieb Daniel Vetter:
On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 4:37 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
I think it also makes sense to skip this with AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think anyone expects implicit sync to happen with those.
Ah yes, I missed this entirely. So the "no implicit flag" is already there, and the _only_ thing that's missing really is a way to fish out the implicit fences, and set them.
https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.n...
So I think all that's really needed in radv is not setting RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl are present (means you need to do some import/export and keep the fd around for winsys buffers, but shouldn't be too bad), and then control the implicit fences entirely explicitly like vk expects.
That is the part I'm less sure about. This is a BO wide flag so we are also disabling implicit sync in the compositor. If the compositor does only do read stuff that is ok, as the inserted exclusive fence will work for that. But as I learned recently the app provided buffer may end up being written to by the X server which open a whole can of potential problems if implicit sync gets disabled between Xserver operations on the app provided buffer. Hence setting that on the WSI buffer is a whole new can of potential problems and hence I've said a submission based flag would be preferred.
I can certainly try it out though.
Hm yeah that's the wrong flag. We need a flag on the drm_file which the explicit userspace sets. And which is valid only for itself.
There's a nice flags field when creating a ctx, but it's not validated and there's already a comment that we have to filter out garbage priority, so that's not use. I'll whip up something entirely untested just as a draft.
We could provide an IOCTL for the BO to change the flag.
That's not the semantics we need.
But could we first figure out the semantics we want to use here?
Cause I'm pretty sure we don't actually need those changes at all and as said before I'm certainly NAKing things which break existing use cases.
Please read how other drivers do this and at least _try_ to understand it. I'm really loosing my patience here with you NAKing patches you're not even understanding (or did you actually read and fully understand the entire story I typed up here, and your NAK is on the entire thing?). There's not much useful conversation to be had with that approach. And with drivers I mean kernel + userspace here.
That's the other frustration part: You're trying to fix this purely in the kernel. This is exactly one of these issues why we require open source userspace, so that we can fix the issues correctly across the entire stack. And meanwhile you're steadfastily refusing to even look at that the userspace side of the picture.
Also I thought through your tlb issue, why are you even putting these tlb flush fences into the shard dma_resv slots? If you store them somewhere else in the amdgpu private part, the oversync issues goes away - in your ttm bo move callback, you can just make your bo copy job depend on them too (you have to anyway) - even for p2p there's not an issue here, because you have the ->move_notify callback, and can then lift the tlb flush fences from your private place to the shared slots so the exporter can see them.
The kernel move fences otoh are a bit more nasty to wring through the p2p dma-buf interface. That one probably needs something new. -Daniel
Regards, Christian.
-Daniel
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
e->bo_va = amdgpu_vm_bo_find(vm, bo); }
-- 2.31.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 21.05.21 um 20:31 schrieb Daniel Vetter:
[SNIP]
We could provide an IOCTL for the BO to change the flag.
That's not the semantics we need.
But could we first figure out the semantics we want to use here?
Cause I'm pretty sure we don't actually need those changes at all and as said before I'm certainly NAKing things which break existing use cases.
Please read how other drivers do this and at least _try_ to understand it. I'm really loosing my patience here with you NAKing patches you're not even understanding (or did you actually read and fully understand the entire story I typed up here, and your NAK is on the entire thing?). There's not much useful conversation to be had with that approach. And with drivers I mean kernel + userspace here.
Well to be honest I did fully read that, but I was just to emotionally attached to answer more appropriately in that moment.
And I'm sorry that I react emotional on that, but it is really frustrating that I'm not able to convince you that we have a major problem which affects all drivers and not just amdgpu.
Regarding the reason why I'm NAKing this particular patch, you are breaking existing uAPI for RADV with that. And as a maintainer of the driver I have simply no other choice than saying halt, stop we can't do it like this.
I'm perfectly aware that I've some holes in the understanding of how ANV or other Vulkan/OpenGL stacks work. But you should probably also admit that you have some holes how amdgpu works or otherwise I can't imagine why you suggest a patch which simply breaks RADV.
I mean we are working together for years now and I think you know me pretty well, do you really think I scream bloody hell we can't do this without a good reason?
So let's stop throwing halve backed solutions at each other and discuss what we can do to solve the different problems we are both seeing here.
That's the other frustration part: You're trying to fix this purely in the kernel. This is exactly one of these issues why we require open source userspace, so that we can fix the issues correctly across the entire stack. And meanwhile you're steadfastily refusing to even look at that the userspace side of the picture.
Well I do fully understand the userspace side of the picture for the AMD stack. I just don't think we should give userspace that much control over the fences in the dma_resv object without untangling them from resource management.
And RADV is exercising exclusive sync for amdgpu already. You can do submission to both the GFX, Compute and SDMA queues in Vulkan and those currently won't over-synchronize.
When you then send a texture generated by multiple engines to the Compositor the kernel will correctly inserts waits for all submissions of the other process.
So this already works for RADV and completely without the IOCTL Jason proposed. IIRC we also have unit tests which exercised that feature for the video decoding use case long before RADV even existed.
And yes I have to admit that I haven't thought about interaction with other drivers when I came up with this because the rules of that interaction wasn't clear to me at that time.
Also I thought through your tlb issue, why are you even putting these tlb flush fences into the shard dma_resv slots? If you store them somewhere else in the amdgpu private part, the oversync issues goes away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from your private place to the shared slots so the exporter can see them.
Because adding a shared fence requires that this shared fence signals after the exclusive fence. And this is a perfect example to explain why this is so problematic and also why why we currently stumble over that only in amdgpu.
In TTM we have a feature which allows evictions to be pipelined and don't wait for the evicting DMA operation. Without that driver will stall waiting for their allocations to finish when we need to allocate memory.
For certain use cases this gives you a ~20% fps increase under memory pressure, so it is a really important feature.
This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Because of this we have some rather crude workarounds in amdgpu. For example IIRC we manual wait for any potential exclusive fence before freeing memory.
We could enable this feature for radeon and nouveau as well with an one line change. But that would mean we need to maintain the workarounds for shortcomings of the dma_resv object design in those drivers as well.
To summarize I think that adding an unbound fence to protect an object is a perfectly valid operation for resource management, but this is restricted by the needs of implicit sync at the moment.
The kernel move fences otoh are a bit more nasty to wring through the p2p dma-buf interface. That one probably needs something new.
Well the p2p interface are my least concern.
Adding the move fence means that you need to touch every place we do CS or page flip since you now have something which is parallel to the explicit sync fence.
Otherwise having the move fence separately wouldn't make much sense in the first place if we always set it together with the exclusive fence.
Best regards and sorry for getting on your nerves so much, Christian.
-Daniel
Regards, Christian.
-Daniel
Are you bored enough to type this up for radv? I'll give Jason's kernel stuff another review meanwhile. -Daniel
> e->bo_va = amdgpu_vm_bo_find(vm, bo); > } > -- > 2.31.0 >
-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter:
[SNIP]
We could provide an IOCTL for the BO to change the flag.
That's not the semantics we need.
But could we first figure out the semantics we want to use here?
Cause I'm pretty sure we don't actually need those changes at all and as said before I'm certainly NAKing things which break existing use cases.
Please read how other drivers do this and at least _try_ to understand it. I'm really loosing my patience here with you NAKing patches you're not even understanding (or did you actually read and fully understand the entire story I typed up here, and your NAK is on the entire thing?). There's not much useful conversation to be had with that approach. And with drivers I mean kernel + userspace here.
Well to be honest I did fully read that, but I was just to emotionally attached to answer more appropriately in that moment.
And I'm sorry that I react emotional on that, but it is really frustrating that I'm not able to convince you that we have a major problem which affects all drivers and not just amdgpu.
Regarding the reason why I'm NAKing this particular patch, you are breaking existing uAPI for RADV with that. And as a maintainer of the driver I have simply no other choice than saying halt, stop we can't do it like this.
I'm perfectly aware that I've some holes in the understanding of how ANV or other Vulkan/OpenGL stacks work. But you should probably also admit that you have some holes how amdgpu works or otherwise I can't imagine why you suggest a patch which simply breaks RADV.
I mean we are working together for years now and I think you know me pretty well, do you really think I scream bloody hell we can't do this without a good reason?
So let's stop throwing halve backed solutions at each other and discuss what we can do to solve the different problems we are both seeing here.
Well this was meant to be a goal post/semantics discussion starter. Yes the patch breaks performance (but not correctness) for amdgpu, but it also contains my suggestion for how to fix that issue (in text form at least).
Plus a plan how to roll it out so that anyone who cares doesn't hit the perf issues this patch can cause.
Also the overall series is really meant as a subsystem wide assessment of the status quo. Aside from a correctness issue Lucas spotted in my panfrost patches no substantial input from others on this yet unfortunately. I need to poke more people I think.
Anyway since the plan as a text didn't stick I'm typing up now something more detailed in form of amdgpu patches. Maybe Bas can do the radv conversion too.
It won't be complete by far either (I'm not working for amd after all ...), I'll leave out the opengl/media side entirely. But if this works for radv is should be a useful blueprint for gl/media too (with some changes in the interfaces, but not really the exposed semantics).
That's the other frustration part: You're trying to fix this purely in the kernel. This is exactly one of these issues why we require open source userspace, so that we can fix the issues correctly across the entire stack. And meanwhile you're steadfastily refusing to even look at that the userspace side of the picture.
Well I do fully understand the userspace side of the picture for the AMD stack. I just don't think we should give userspace that much control over the fences in the dma_resv object without untangling them from resource management.
And RADV is exercising exclusive sync for amdgpu already. You can do submission to both the GFX, Compute and SDMA queues in Vulkan and those currently won't over-synchronize.
When you then send a texture generated by multiple engines to the Compositor the kernel will correctly inserts waits for all submissions of the other process.
So this already works for RADV and completely without the IOCTL Jason proposed. IIRC we also have unit tests which exercised that feature for the video decoding use case long before RADV even existed.
Yeah multiple engines on the producer side works fine with the current scheme you have (if we ignore for now that the way amdgpu uses dma_resv is different from other drivers by not setting the exclusive slots for producers).
Where it breaks down is you have overlapping reads once a frame is generated, on either side. E.g.
- compositors read the buffer as consumer - but also producer reads the buffer again (maybe reference frame for media, or maybe for some post-processing like motion blurr or whatever).
Then your current scheme really badly oversyncs, while other drivers don't have this issue. Fixing this for amdgpu will have quite a big impact on how dma_resv will be used by amdgpu, and that's why I think before we've looked at this it doesn't make sense to look at changing dma_resv in big ways.
I do think the AMDGPU_SYNC_NE_OWNER trick is pretty neat, and should help with some very non-explicit userspace (gl, maybe also media but at least ours is internally moving to a more explicit model I think) when you split the workload over multiple engines. For vulkan I think the right model is AMDGPU_SYNC_EXPLICIT, plus something like what Jason has.
Except there's going to be quite some twist, but I think that's best explained in patches since text clearly doesn't work well.
And yes I have to admit that I haven't thought about interaction with other drivers when I came up with this because the rules of that interaction wasn't clear to me at that time.
Ok I think we're at least distilling the core of our disagreement here, that's some progress:
- my take: to make implicit sync work well and avoid oversync issues, userspace, and currently amgpu lacks these uapis so needs to get those added.
- your take: please lets not give so much control to userspace
Tbh I'm not yet fully understanding your concern here, but I do agree that there's a lot of things that need to be carefully audited here, at least in amdgpu. From my driver wide audit I do think in general playing clever tricks is ok, since drivers treat the exclusive fence as just a special type of shared fence and don't e.g. ignore the shard fences if an exclusive one is present. For the generic helper version of this see drm_gem_fence_array_add_implicit(), but afaiui amdgpu works the same (or at least similar enough), as do other drivers playing funny games.
Also I thought through your tlb issue, why are you even putting these tlb flush fences into the shard dma_resv slots? If you store them somewhere else in the amdgpu private part, the oversync issues goes away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from your private place to the shared slots so the exporter can see them.
Because adding a shared fence requires that this shared fence signals after the exclusive fence. And this is a perfect example to explain why this is so problematic and also why why we currently stumble over that only in amdgpu.
So I also have vague collections that we agreed this is required, but I don't think it's the case. I'm at least pretty sure we already have multiple drivers which violate this (non of them using TTM yet).
In TTM we have a feature which allows evictions to be pipelined and don't wait for the evicting DMA operation. Without that driver will stall waiting for their allocations to finish when we need to allocate memory.
For certain use cases this gives you a ~20% fps increase under memory pressure, so it is a really important feature.
Yeah that's something I'm banging my head against right now a bit for my amdgpu demo patch series.
This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
Because of this we have some rather crude workarounds in amdgpu. For example IIRC we manual wait for any potential exclusive fence before freeing memory.
We could enable this feature for radeon and nouveau as well with an one line change. But that would mean we need to maintain the workarounds for shortcomings of the dma_resv object design in those drivers as well.
To summarize I think that adding an unbound fence to protect an object is a perfectly valid operation for resource management, but this is restricted by the needs of implicit sync at the moment.
Hm how are unbound fences (what do you mean with this exactly) come into play here now? I think you lost me here on the last 1-2 paragraphs, before that I think I followed.
The kernel move fences otoh are a bit more nasty to wring through the p2p dma-buf interface. That one probably needs something new.
Well the p2p interface are my least concern.
Adding the move fence means that you need to touch every place we do CS or page flip since you now have something which is parallel to the explicit sync fence.
Otherwise having the move fence separately wouldn't make much sense in the first place if we always set it together with the exclusive fence.
Yeah it's a bunch of work. But for i915 this is the model we have, so we have to do it anyway, so I have really good excuse here to do that ttm audit.
Best regards and sorry for getting on your nerves so much,
tbh I've also been rather thinly nerved on this. One side was that I spent the last 1+ years having an eerily similar with i915-gem team about how a single driver can't just have totally different rules for core stuff like dma_resv/fence/locks, and unfortunately that entire story went really, horribly wrong :-/ So I'm very much "oh noes pls not again".
But also the long w/e here helped, yay :-)
I think there's a few questions here that we can ping/pong a bit more, but I think for the next meaningful round I need to get this draft set of patches a bit into shape here, and audit more code. I think hopefully early next week I'll have something which isn't too much full of holes which should help in moving forward on this discussion.
Cheers, Daniel
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter:
[SNIP]
We could provide an IOCTL for the BO to change the flag.
That's not the semantics we need.
But could we first figure out the semantics we want to use here?
Cause I'm pretty sure we don't actually need those changes at all and as said before I'm certainly NAKing things which break existing use cases.
Please read how other drivers do this and at least _try_ to understand it. I'm really loosing my patience here with you NAKing patches you're not even understanding (or did you actually read and fully understand the entire story I typed up here, and your NAK is on the entire thing?). There's not much useful conversation to be had with that approach. And with drivers I mean kernel + userspace here.
Well to be honest I did fully read that, but I was just to emotionally attached to answer more appropriately in that moment.
And I'm sorry that I react emotional on that, but it is really frustrating that I'm not able to convince you that we have a major problem which affects all drivers and not just amdgpu.
Regarding the reason why I'm NAKing this particular patch, you are breaking existing uAPI for RADV with that. And as a maintainer of the driver I have simply no other choice than saying halt, stop we can't do it like this.
I'm perfectly aware that I've some holes in the understanding of how ANV or other Vulkan/OpenGL stacks work. But you should probably also admit that you have some holes how amdgpu works or otherwise I can't imagine why you suggest a patch which simply breaks RADV.
I mean we are working together for years now and I think you know me pretty well, do you really think I scream bloody hell we can't do this without a good reason?
So let's stop throwing halve backed solutions at each other and discuss what we can do to solve the different problems we are both seeing here.
Well this was meant to be a goal post/semantics discussion starter. Yes the patch breaks performance (but not correctness) for amdgpu, but it also contains my suggestion for how to fix that issue (in text form at least).
Well as far as I can see this really breaks uAPI, we have unit tests exercising this.
But see more on this below.
Plus a plan how to roll it out so that anyone who cares doesn't hit the perf issues this patch can cause.
Also the overall series is really meant as a subsystem wide assessment of the status quo. Aside from a correctness issue Lucas spotted in my panfrost patches no substantial input from others on this yet unfortunately. I need to poke more people I think.
Anyway since the plan as a text didn't stick I'm typing up now something more detailed in form of amdgpu patches. Maybe Bas can do the radv conversion too.
It won't be complete by far either (I'm not working for amd after all ...), I'll leave out the opengl/media side entirely. But if this works for radv is should be a useful blueprint for gl/media too (with some changes in the interfaces, but not really the exposed semantics).
Yeah, but to make my point clear once more: I can't allow any patch in which would change amdgpus existing uAPI behavior.
What we can talk about is changing the behavior by adding flags to the fpriv to change the behavior and/or stuff the CS fence by default into the exclusive slot.
For the later I think we could do something like using a dma_fence_chain for the exclusive fence in amdgpu. This way we would have the same semantics in the CS and still support the epoll and Jasons new import IOCTL.
But the semantics of the amdgpu CS interface to not serialize from the same process and always serialize if you see some different process must stay the same or otherwise I have quite a bunch of angry end users.
That's the other frustration part: You're trying to fix this purely in the kernel. This is exactly one of these issues why we require open source userspace, so that we can fix the issues correctly across the entire stack. And meanwhile you're steadfastily refusing to even look at that the userspace side of the picture.
Well I do fully understand the userspace side of the picture for the AMD stack. I just don't think we should give userspace that much control over the fences in the dma_resv object without untangling them from resource management.
And RADV is exercising exclusive sync for amdgpu already. You can do submission to both the GFX, Compute and SDMA queues in Vulkan and those currently won't over-synchronize.
When you then send a texture generated by multiple engines to the Compositor the kernel will correctly inserts waits for all submissions of the other process.
So this already works for RADV and completely without the IOCTL Jason proposed. IIRC we also have unit tests which exercised that feature for the video decoding use case long before RADV even existed.
Yeah multiple engines on the producer side works fine with the current scheme you have (if we ignore for now that the way amdgpu uses dma_resv is different from other drivers by not setting the exclusive slots for producers).
Where it breaks down is you have overlapping reads once a frame is generated, on either side. E.g.
- compositors read the buffer as consumer
- but also producer reads the buffer again (maybe reference frame for media, or maybe for some post-processing like motion blurr or whatever).
Then your current scheme really badly oversyncs, while other drivers don't have this issue.
No, that is correct behavior. This was added intentionally and we have an use cases which rely on this.
The concept is that when two processes access the same buffer object they should serialize, no matter what.
You can opt-out of that by setting the EXPLICIT_SYNC flag on the BO.
Fixing this for amdgpu will have quite a big impact on how dma_resv will be used by amdgpu, and that's why I think before we've looked at this it doesn't make sense to look at changing dma_resv in big ways.
I do think the AMDGPU_SYNC_NE_OWNER trick is pretty neat, and should help with some very non-explicit userspace (gl, maybe also media but at least ours is internally moving to a more explicit model I think) when you split the workload over multiple engines. For vulkan I think the right model is AMDGPU_SYNC_EXPLICIT, plus something like what Jason has.
Except there's going to be quite some twist, but I think that's best explained in patches since text clearly doesn't work well.
Yeah, I'm also wondering how we can ever merge the two approaches back together. I'm not sure that this is possible.
And yes I have to admit that I haven't thought about interaction with other drivers when I came up with this because the rules of that interaction wasn't clear to me at that time.
Ok I think we're at least distilling the core of our disagreement here, that's some progress:
my take: to make implicit sync work well and avoid oversync issues, userspace, and currently amgpu lacks these uapis so needs to get those added.
your take: please lets not give so much control to userspace
Tbh I'm not yet fully understanding your concern here, but I do agree that there's a lot of things that need to be carefully audited here, at least in amdgpu. From my driver wide audit I do think in general playing clever tricks is ok, since drivers treat the exclusive fence as just a special type of shared fence and don't e.g. ignore the shard fences if an exclusive one is present. For the generic helper version of this see drm_gem_fence_array_add_implicit(), but afaiui amdgpu works the same (or at least similar enough), as do other drivers playing funny games.
Well do you mean the other way around? E.g. that you can *not* ignore the exclusive fence when shared once are present?
As far as I know nouveau is the only driver left using that (maybe i915 somewhere, but not sure) and it would really help if we could remove that.
Also I thought through your tlb issue, why are you even putting these tlb flush fences into the shard dma_resv slots? If you store them somewhere else in the amdgpu private part, the oversync issues goes away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from your private place to the shared slots so the exporter can see them.
Because adding a shared fence requires that this shared fence signals after the exclusive fence. And this is a perfect example to explain why this is so problematic and also why why we currently stumble over that only in amdgpu.
So I also have vague collections that we agreed this is required, but I don't think it's the case. I'm at least pretty sure we already have multiple drivers which violate this (non of them using TTM yet).
Yeah, I know. Basically everybody which adds more than a CS fence to the shared slots will ignore this.
As said above if we drop this (which would be great) at least nouveau would need to be fixed.
In TTM we have a feature which allows evictions to be pipelined and don't wait for the evicting DMA operation. Without that driver will stall waiting for their allocations to finish when we need to allocate memory.
For certain use cases this gives you a ~20% fps increase under memory pressure, so it is a really important feature.
Yeah that's something I'm banging my head against right now a bit for my amdgpu demo patch series.
This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
... fence = dma_resv_get_excl(bo->tbo.base.resv); if (fence) { amdgpu_bo_fence(bo, fence, true); fence = NULL; } ...
We explicitly added that because resource management of some other driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and mailing history of that.
Because of this we have some rather crude workarounds in amdgpu. For example IIRC we manual wait for any potential exclusive fence before freeing memory.
We could enable this feature for radeon and nouveau as well with an one line change. But that would mean we need to maintain the workarounds for shortcomings of the dma_resv object design in those drivers as well.
To summarize I think that adding an unbound fence to protect an object is a perfectly valid operation for resource management, but this is restricted by the needs of implicit sync at the moment.
Hm how are unbound fences (what do you mean with this exactly) come into play here now? I think you lost me here on the last 1-2 paragraphs, before that I think I followed.
Unbound like in not waiting for the exclusive fence.
The kernel move fences otoh are a bit more nasty to wring through the p2p dma-buf interface. That one probably needs something new.
Well the p2p interface are my least concern.
Adding the move fence means that you need to touch every place we do CS or page flip since you now have something which is parallel to the explicit sync fence.
Otherwise having the move fence separately wouldn't make much sense in the first place if we always set it together with the exclusive fence.
Yeah it's a bunch of work. But for i915 this is the model we have, so we have to do it anyway, so I have really good excuse here to do that ttm audit.
Best regards and sorry for getting on your nerves so much,
tbh I've also been rather thinly nerved on this. One side was that I spent the last 1+ years having an eerily similar with i915-gem team about how a single driver can't just have totally different rules for core stuff like dma_resv/fence/locks, and unfortunately that entire story went really, horribly wrong :-/ So I'm very much "oh noes pls not again".
Dito. I mean I didn't came up with the approach for amdgpu out of nowhere, but rather because we had some specific use cases for that.
Back in the R6xx days some guys where putting a semaphore at the end of each BO and serialized engine access to the BO using that. E.g. see how radeon works internally with the semaphore block.
This is basically the same concept implemented with the dma_resv object instead.
But also the long w/e here helped, yay :-)
Yeah, indeed.
Cheers, Christian.
I think there's a few questions here that we can ping/pong a bit more, but I think for the next meaningful round I need to get this draft set of patches a bit into shape here, and audit more code. I think hopefully early next week I'll have something which isn't too much full of holes which should help in moving forward on this discussion.
Cheers, Daniel
On Tue, May 25, 2021 at 5:05 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter: This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
... fence = dma_resv_get_excl(bo->tbo.base.resv); if (fence) { amdgpu_bo_fence(bo, fence, true); fence = NULL; } ...
We explicitly added that because resource management of some other driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and mailing history of that.
Hm I looked and it's
commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb Author: Christian König christian.koenig@amd.com Date: Thu Mar 12 12:03:34 2020 +0100
drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
That sounded more like amdgpu itself needing this, not another driver?
But looking at amdgpu_vm_bo_update_mapping() it seems to pick the right fencing mode for gpu pte clearing, so I'm really not sure what the bug was that you worked around here?The implementation boils down to amdgpu_sync_resv() which syncs for the exclusive fence, always. And there's nothing else that I could find in public history at least, no references to bug reports or anything. I think you need to dig internally, because as-is I'm not seeing the problem here.
Or am I missing something here? -Daniel
Am 25.05.21 um 17:23 schrieb Daniel Vetter:
On Tue, May 25, 2021 at 5:05 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter: This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
... fence = dma_resv_get_excl(bo->tbo.base.resv); if (fence) { amdgpu_bo_fence(bo, fence, true); fence = NULL; } ...
We explicitly added that because resource management of some other driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and mailing history of that.
Hm I looked and it's
commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb Author: Christian König christian.koenig@amd.com Date: Thu Mar 12 12:03:34 2020 +0100
drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
That sounded more like amdgpu itself needing this, not another driver?
No, that patch was just a follow up moving the functionality around.
But looking at amdgpu_vm_bo_update_mapping() it seems to pick the right fencing mode for gpu pte clearing, so I'm really not sure what the bug was that you worked around here?The implementation boils down to amdgpu_sync_resv() which syncs for the exclusive fence, always. And there's nothing else that I could find in public history at least, no references to bug reports or anything. I think you need to dig internally, because as-is I'm not seeing the problem here.
Or am I missing something here?
See the code here for example: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouve...
Nouveau assumes that when a shared fence is present it doesn't need to wait for the exclusive one because the shared are always supposed to finish after the exclusive one.
But for page table unmap fences that isn't true and we ran into a really nasty and hard to reproduce bug because of this.
I think it would be much more defensive if we could say that we always wait for the exclusive fence and fix the use case in nouveau and double check if somebody else does stuff like that as well.
Christian.
-Daniel
On Wed, May 26, 2021 at 3:32 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 25.05.21 um 17:23 schrieb Daniel Vetter:
On Tue, May 25, 2021 at 5:05 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter: This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
... fence = dma_resv_get_excl(bo->tbo.base.resv); if (fence) { amdgpu_bo_fence(bo, fence, true); fence = NULL; } ...
We explicitly added that because resource management of some other driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and mailing history of that.
Hm I looked and it's
commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb Author: Christian König christian.koenig@amd.com Date: Thu Mar 12 12:03:34 2020 +0100
drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
That sounded more like amdgpu itself needing this, not another driver?
No, that patch was just a follow up moving the functionality around.
That patch added the "add exclusive fence to shared slots before amdgpu_vm_clear_freed() call", which I thought was at least part of your fix.
But looking at amdgpu_vm_bo_update_mapping() it seems to pick the right fencing mode for gpu pte clearing, so I'm really not sure what the bug was that you worked around here?The implementation boils down to amdgpu_sync_resv() which syncs for the exclusive fence, always. And there's nothing else that I could find in public history at least, no references to bug reports or anything. I think you need to dig internally, because as-is I'm not seeing the problem here.
Or am I missing something here?
See the code here for example: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouve...
Nouveau assumes that when a shared fence is present it doesn't need to wait for the exclusive one because the shared are always supposed to finish after the exclusive one.
But for page table unmap fences that isn't true and we ran into a really nasty and hard to reproduce bug because of this.
I think it would be much more defensive if we could say that we always wait for the exclusive fence and fix the use case in nouveau and double check if somebody else does stuff like that as well.
Yeah most other drivers do the defensive thing here. I think it would be good to standardize on that. I'll add that to my list and do more auditing. -Daniel
Am 21.05.21 um 11:09 schrieb Daniel Vetter:
Docs for struct dma_resv are fairly clear:
"A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)."
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freede...
Furthermore a review across all of upstream.
First of render drivers and how they set implicit fences:
nouveau follows this contract, see in validate_fini_no_ticket()
nouveau_bo_fence(nvbo, fence, !!b->write_domains);
and that last boolean controls whether the exclusive or shared fence slot is used.
radeon follows this contract by setting
p->relocs[i].tv.num_shared = !r->write_domain;
in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing.
vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot.
etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences()
i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract.
lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo
msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess
v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved
v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos()
vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu
vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence()
This covers the setting of the exclusive fences when writing.
Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few:
i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan)
etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback.
v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small.
v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up.
panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper.
lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d.
msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
all the remaining drivers are ttm based, where I hope they do appropriately obey implicit fences already. I didn't do the full audit there because a) not follow the contract would confuse ttm quite well and b) reading non-standard scheduler and submit code which isn't based on drm/scheduler is a pain.
Onwards to the display side.
Any driver using the drm_gem_plane_helper_prepare_fb() helper will correctly. Overwhelmingly most drivers get this right, except a few totally dont. I'll follow up with a patch to make this the default and avoid a bunch of bugs.
I didn't audit the ttm drivers, but given that dma_resv started there I hope they get this right.
In conclusion this IS the contract, both as documented and overwhelmingly implemented, specically as implemented by all render drivers except amdgpu.
Amdgpu tried to fix this already in
commit 049aca4363d8af87cab8d53de5401602db3b9999 Author: Christian König christian.koenig@amd.com Date: Wed Sep 19 16:54:35 2018 +0200
drm/amdgpu: fix using shared fence for exported BOs v2
but this fix falls short on a number of areas:
It's racy, by the time the buffer is shared it might be too late. To make sure there's definitely never a problem we need to set the fences correctly for any buffer that's potentially exportable.
It's breaking uapi, dma-buf fds support poll() and differentitiate between, which was introduced in
commit 9b495a5887994a6d74d5c261d012083a92b94738 Author: Maarten Lankhorst maarten.lankhorst@canonical.com Date: Tue Jul 1 12:57:43 2014 +0200
dma-buf: add poll support, v3
Christian König wants to nack new uapi building further on this dma_resv contract because it breaks amdgpu, quoting
"Yeah, and that is exactly the reason why I will NAK this uAPI change.
"This doesn't works for amdgpu at all for the reasons outlined above."
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...
Rejecting new development because your own driver is broken and violates established cross driver contracts and uapi is really not how upstream works.
Now this patch will have a severe performance impact on anything that runs on multiple engines. So we can't just merge it outright, but need a bit a plan:
amdgpu needs a proper uapi for handling implicit fencing. The funny thing is that to do it correctly, implicit fencing must be treated as a very strange IPC mechanism for transporting fences, where both setting the fence and dependency intercepts must be handled explicitly. Current best practices is a per-bo flag to indicate writes, and a per-bo flag to to skip implicit fencing in the CS ioctl as a new chunk.
Since amdgpu has been shipping with broken behaviour we need an opt-out flag from the butchered implicit fencing model to enable the proper explicit implicit fencing model.
for kernel memory fences due to bo moves at least the i915 idea is to use ttm_bo->moving. amdgpu probably needs the same.
since the current p2p dma-buf interface assumes the kernel memory fence is in the exclusive dma_resv fence slot we need to add a new fence slot for kernel fences, which must never be ignored. Since currently only amdgpu supports this there's no real problem here yet, until amdgpu gains a NO_IMPLICIT CS flag.
New userspace needs to ship in enough desktop distros so that users wont notice the perf impact. I think we can ignore LTS distros who upgrade their kernels but not their mesa3d snapshot.
Then when this is all in place we can merge this patch here.
What is not a solution to this problem here is trying to make the dma_resv rules in the kernel more clever. The fundamental issue here is that the amdgpu CS uapi is the least expressive one across all drivers (only equalled by panfrost, which has an actual excuse) by not allowing any userspace control over how implicit sync is conducted.
Until this is fixed it's completely pointless to make the kernel more clever to improve amdgpu, because all we're doing is papering over this uapi design issue. amdgpu needs to attain the status quo established by other drivers first, once that's achieved we can tackle the remaining issues in a consistent way across drivers.
Cc: mesa-dev@lists.freedesktop.org Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Dave Airlie airlied@gmail.com Cc: Rob Clark robdclark@chromium.org Cc: Kristian H. Kristensen hoegsberg@google.com Cc: Michel Dänzer michel@daenzer.net Cc: Daniel Stone daniels@collabora.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Deepak R Varma mh12gx2825@gmail.com Cc: Chen Li chenli@uniontech.com Cc: Kevin Wang kevin1.wang@amd.com Cc: Dennis Li Dennis.Li@amd.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
And as explained before this is a general NAK.
I'm not discussing this further until we have fixed the dma_resv rules for implicit synchronization since this will just result in every command submission serializing all accesses to BOs which is certainly not what we want.
Regards, Christian.
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 88a24a0b5691..cc8426e1e8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
/* Make sure we use the exclusive slot for all potentially shared BOs */
e->bo_va = amdgpu_vm_bo_find(vm, bo); }if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)) e->tv.num_shared = 0;
On Fri, May 21, 2021 at 2:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
- msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
Yeah, drm/scheduler is (along with a lot of other things) on the TODO list. But this isn't quite as bad as it sounds because userspace uses a u_queue thread to call the submit ioctl rather than blocking the driver. (It also offloads some other work from the driver thread, like submit merging to reduce # of ioctls. Coincidentally that arrangement was a step towards preparing userspace for some hypothetical non-ioctl uapi ;-))
OTOH it would be good to move blocking until the system can free enough pages to repin bo's out of the ioctl path to better handle some memory pressure corner cases without having to be interruptable over a lot more of the submit path.. Running chrome+android on devices without a lot of memory is fun..
BR, -R
On Fri, May 21, 2021 at 07:58:57AM -0700, Rob Clark wrote:
On Fri, May 21, 2021 at 2:10 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
- msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea.
Yeah, drm/scheduler is (along with a lot of other things) on the TODO list. But this isn't quite as bad as it sounds because userspace uses a u_queue thread to call the submit ioctl rather than blocking the driver. (It also offloads some other work from the driver thread, like submit merging to reduce # of ioctls. Coincidentally that arrangement was a step towards preparing userspace for some hypothetical non-ioctl uapi ;-))
You're also holding a pile of locks, which I expect latest with multi-engine buffer sharing will be pain. If you push this to the scheduler then the locks aren't held. Or maybe I've misread the flow, it's become all a bit a blurr after all these drivers :-)
OTOH it would be good to move blocking until the system can free enough pages to repin bo's out of the ioctl path to better handle some memory pressure corner cases without having to be interruptable over a lot more of the submit path.. Running chrome+android on devices without a lot of memory is fun..
Uh that one needs the userspace thread. Or entirely different semantics of your ioctl, because you're not allowed to allocate memory once any dma_fence is visible. So offloading the entire pinning to a submit thread is no-go. -Daniel
linaro-mm-sig@lists.linaro.org