Hi everybody,
since I've got positive feedback from Daniel I continued working on this approach.
A few issues are still open: 1. Daniel suggested that I make the invalidate_mappings callback a parameter of dma_buf_attach().
This approach unfortunately won't work because when the attachment is created the importer is not necessarily ready to handle invalidation events.
E.g. in the amdgpu example we first need to setup the imported GEM/TMM objects and install that in the attachment.
My solution is to introduce a separate function to grab the locks and set the callback, this function could then be used to pin the buffer later on if that turns out to be necessary after all.
2. With my example setup this currently results in a ping/pong situation because the exporter prefers a VRAM placement while the importer prefers a GTT placement.
This results in quite a performance drop, but can be fixed by a simple mesa patch which allows shred BOs to be placed in both VRAM and GTT.
Question is what should we do in the meantime? Accept the performance drop or only allow unpinned sharing with new Mesa?
Please review and comment, Christian.
Each importer can now provide an invalidate_mappings callback.
This allows the exporter to provide the mappings without the need to pin the backing store.
v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-buf.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 38 ++++++++++++++++++++++++++++++ 2 files changed, 98 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..ed2b3559ba25 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -572,7 +572,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, if (ret) goto err_attach; } + reservation_object_lock(dmabuf->resv, NULL); list_add(&attach->node, &dmabuf->attachments); + reservation_object_unlock(dmabuf->resv);
mutex_unlock(&dmabuf->lock); return attach; @@ -598,7 +600,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) return;
mutex_lock(&dmabuf->lock); + reservation_object_lock(dmabuf->resv, NULL); list_del(&attach->node); + reservation_object_unlock(dmabuf->resv); if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
@@ -632,10 +636,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL);
+ /* + * Mapping a DMA-buf can trigger its invalidation, prevent sending this + * event to the caller by temporary removing this attachment from the + * list. + */ + if (attach->invalidate_mappings) { + reservation_object_assert_held(attach->dmabuf->resv); + list_del(&attach->node); + } + sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM);
+ if (attach->invalidate_mappings) + list_add(&attach->node, &attach->dmabuf->attachments); + return sg_table; } EXPORT_SYMBOL_GPL(dma_buf_map_attachment); @@ -656,6 +673,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep();
+ if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return;
@@ -664,6 +684,44 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
+/** + * dma_buf_set_invalidate_callback - set the invalidate_mappings callback + * + * @attach: [in] attachment where to set the callback + * @cb: [in] the callback to set + * + * Makes sure to take the appropriate locks when updating the invalidate + * mappings callback. + */ +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach, + void (*cb)(struct dma_buf_attachment *)) +{ + reservation_object_lock(attach->dmabuf->resv, NULL); + attach->invalidate_mappings = cb; + reservation_object_unlock(attach->dmabuf->resv); +} +EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback); + +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf: [in] buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, &dmabuf->attachments, node) + if (attach->invalidate_mappings) + attach->invalidate_mappings(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * @@ -1121,10 +1179,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_puts(s, "\tAttached Devices:\n"); attach_count = 0;
+ reservation_object_lock(buf_obj->resv, NULL); list_for_each_entry(attach_obj, &buf_obj->attachments, node) { seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); attach_count++; } + reservation_object_unlock(buf_obj->resv);
seq_printf(s, "Total %d devices attached\n\n", attach_count); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 085db2fee2d7..70c65fcfe1e3 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -91,6 +91,18 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
+ /** + * @supports_mapping_invalidation: + * + * True for exporters which supports unpinned DMA-buf operation using + * the reservation lock. + * + * When attachment->invalidate_mappings is set the @map_dma_buf and + * @unmap_dma_buf callbacks can be called with the reservation lock + * held. + */ + bool supports_mapping_invalidation; + /** * @map_dma_buf: * @@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv; + + /** + * @invalidate_mappings: + * + * Optional callback provided by the importer of the attachment which + * must be set before mappings are created. + * + * If provided the exporter can avoid pinning the backing store while + * mappings exists. + * + * The function is called with the lock of the reservation object + * associated with the dma_buf held and the mapping function must be + * called with this lock held as well. This makes sure that no mapping + * is created concurrently with an ongoing invalidation. + * + * After the callback all existing mappings are still valid until all + * fences in the dma_bufs reservation object are signaled, but should be + * destroyed by the importer as soon as possible. + * + * New mappings can be created immediately, but can't be used before the + * exclusive fence in the dma_bufs reservation object is signaled. + */ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); };
/** @@ -391,6 +426,9 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach, + void (*cb)(struct dma_buf_attachment *)); +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
Quoting Christian König (2018-03-16 13:20:45)
@@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv;
/**
* @invalidate_mappings:
*
* Optional callback provided by the importer of the attachment which
* must be set before mappings are created.
*
* If provided the exporter can avoid pinning the backing store while
* mappings exists.
Hmm, no I don't think it avoids the pinning issue entirely. As it stands, the importer doesn't have a page refcount and so they all rely on the exporter keeping the dmabuf pages pinned while attached. What can happen is that given the invalidate cb, the importers can revoke their attachments, letting the exporter recover the pages/sg, and then start again from scratch.
That also neatly answers what happens if not all importers provide an invalidate cb, or fail, the dmabuf remains pinned and the exporter must retreat.
*
* The function is called with the lock of the reservation object
* associated with the dma_buf held and the mapping function must be
* called with this lock held as well. This makes sure that no mapping
* is created concurrently with an ongoing invalidation.
*
* After the callback all existing mappings are still valid until all
* fences in the dma_bufs reservation object are signaled, but should be
* destroyed by the importer as soon as possible.
*
* New mappings can be created immediately, but can't be used before the
* exclusive fence in the dma_bufs reservation object is signaled.
*/
void (*invalidate_mappings)(struct dma_buf_attachment *attach);
The intent is that the invalidate is synchronous and immediate, while locked? We are looking at recursing back into the dma_buf functions to remove each attachment from the invalidate cb (as well as waiting for dma), won't that cause some nasty issues? -Chris
Am 16.03.2018 um 14:51 schrieb Chris Wilson:
Quoting Christian König (2018-03-16 13:20:45)
@@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv;
/**
* @invalidate_mappings:
*
* Optional callback provided by the importer of the attachment which
* must be set before mappings are created.
*
* If provided the exporter can avoid pinning the backing store while
* mappings exists.
Hmm, no I don't think it avoids the pinning issue entirely. As it stands, the importer doesn't have a page refcount and so they all rely on the exporter keeping the dmabuf pages pinned while attached. What can happen is that given the invalidate cb, the importers can revoke their attachments, letting the exporter recover the pages/sg, and then start again from scratch.
Yes, exactly. The wording is just not 100% precise and I haven't found something better so far.
That also neatly answers what happens if not all importers provide an invalidate cb, or fail, the dmabuf remains pinned and the exporter must retreat.
Yes, exactly as well. As soon as at least one importer says "I can't do this", we must fallback to the old behavior.
*
* The function is called with the lock of the reservation object
* associated with the dma_buf held and the mapping function must be
* called with this lock held as well. This makes sure that no mapping
* is created concurrently with an ongoing invalidation.
*
* After the callback all existing mappings are still valid until all
* fences in the dma_bufs reservation object are signaled, but should be
* destroyed by the importer as soon as possible.
*
* New mappings can be created immediately, but can't be used before the
* exclusive fence in the dma_bufs reservation object is signaled.
*/
void (*invalidate_mappings)(struct dma_buf_attachment *attach);
The intent is that the invalidate is synchronous and immediate, while locked? We are looking at recursing back into the dma_buf functions to remove each attachment from the invalidate cb (as well as waiting for dma), won't that cause some nasty issues?
No, with this idea invalidation is asynchronous. Already discussed that with Daniel as well and YES Daniel also already pointed out that I need to better document this.
When the exporter calls invalidate_mappings() it just means that all importers can no longer use their sg tables for new submissions, old ones stay active.
The existing sg tables are guaranteed to stay valid until all fences in the reservation object have signaled and the importer should also delay it's call to call dma_buf_unmap_attachment() until all the fences have signaled.
When the importer has new work to do, e.g. wants to attach a new fence to the reservation object, it must grab a new sg table for that. The importer also needs to make sure that all new work touching the dma-buf doesn't start before the exclusive fence in the reservation object signals.
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
The importer can then immediately grab a new sg table for the new location of the buffer and use it to prepare the next operation.
Regards, Christian.
-Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context]
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;)
Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris
Am 19.03.2018 um 16:53 schrieb Chris Wilson:
Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context]
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;)
Actually that is harmless.
When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway.
This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver.
The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended.
When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror.
Regards, Christian.
Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Mar 19, 2018 at 5:23 PM, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 19.03.2018 um 16:53 schrieb Chris Wilson:
Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context]
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;)
Actually that is harmless.
When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway.
reservation_object only prevents adding new fences, you still have to wait for all the current ones to signal. Also, we have dma-access without fences in i915. "I hold the reservation_object" does not imply you can just go and nuke the backing storage.
This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver.
The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended.
I think that's very problematic. One reason for an IOMMU is that you have device access isolation, and a broken device can't access memory it shouldn't be able to access. From that security-in-depth point of view it's not cool that there's some sg tables hanging around still that a broken GPU could use. And let's not pretend hw is perfect, especially GPUs :-)
When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror.
So I'm no expert on this, but I think this is exactly what we're doing in i915. Kinda no other way to actually free the memory without throwing all the nice isolation aspects of an IOMMU into the wind. Can you please paste the lockdeps you've seen with amdgpu when trying to do that? -Daniel
Regards, Christian.
Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
On Mon, Mar 19, 2018 at 5:23 PM, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 19.03.2018 um 16:53 schrieb Chris Wilson:
Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context]
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;)
Actually that is harmless.
When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway.
reservation_object only prevents adding new fences, you still have to wait for all the current ones to signal. Also, we have dma-access without fences in i915. "I hold the reservation_object" does not imply you can just go and nuke the backing storage.
I was not talking about taking the lock, but rather using reservation_object_wait_timeout_rcu().
To be more precise you actually can't take the reservation object lock in an mmu_invalidate_range callback and you can only trylock it in a shrink_slab callback.
This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver.
The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended.
I think that's very problematic. One reason for an IOMMU is that you have device access isolation, and a broken device can't access memory it shouldn't be able to access. From that security-in-depth point of view it's not cool that there's some sg tables hanging around still that a broken GPU could use. And let's not pretend hw is perfect, especially GPUs :-)
I completely agree on that, but there is unfortunately no other way.
See you simply can't take a reservation object lock in an mmu or slab callback, you can only trylock them.
For example it would require changing all allocations done while holding any reservation lock to GFP_NOIO.
When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror.
So I'm no expert on this, but I think this is exactly what we're doing in i915. Kinda no other way to actually free the memory without throwing all the nice isolation aspects of an IOMMU into the wind. Can you please paste the lockdeps you've seen with amdgpu when trying to do that?
Taking a quick look at i915 I can definitely say that this is actually quite buggy what you guys do here.
For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier.
Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect().
Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them.
Regards, Christian.
-Daniel
Regards, Christian.
Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
On Tue, Mar 20, 2018 at 11:54:18AM +0100, Christian König wrote:
Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
On Mon, Mar 19, 2018 at 5:23 PM, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 19.03.2018 um 16:53 schrieb Chris Wilson:
Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context]
This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence.
Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;)
Actually that is harmless.
When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway.
reservation_object only prevents adding new fences, you still have to wait for all the current ones to signal. Also, we have dma-access without fences in i915. "I hold the reservation_object" does not imply you can just go and nuke the backing storage.
I was not talking about taking the lock, but rather using reservation_object_wait_timeout_rcu().
To be more precise you actually can't take the reservation object lock in an mmu_invalidate_range callback and you can only trylock it in a shrink_slab callback.
Ah ok, and yes agreed. Kinda. See below.
This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver.
The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended.
I think that's very problematic. One reason for an IOMMU is that you have device access isolation, and a broken device can't access memory it shouldn't be able to access. From that security-in-depth point of view it's not cool that there's some sg tables hanging around still that a broken GPU could use. And let's not pretend hw is perfect, especially GPUs :-)
I completely agree on that, but there is unfortunately no other way.
See you simply can't take a reservation object lock in an mmu or slab callback, you can only trylock them.
For example it would require changing all allocations done while holding any reservation lock to GFP_NOIO.
Yeah mmu and slab can only trylock, and they need to skip to the next object when the trylock fails. But once you have the lock you imo should be able to clean up the entire mess still. We definitely do that for the i915 shrinkers, and I don't see how going to importers through the ->invalidate_mapping callback changes anything with that.
For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
So there's completely feasible ways to make sure the sg list is all properly released, all DMA gone and the IOMMU mappings torn down. Anything else is just a bit shoddy device driver programming imo.
When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror.
So I'm no expert on this, but I think this is exactly what we're doing in i915. Kinda no other way to actually free the memory without throwing all the nice isolation aspects of an IOMMU into the wind. Can you please paste the lockdeps you've seen with amdgpu when trying to do that?
Taking a quick look at i915 I can definitely say that this is actually quite buggy what you guys do here.
Note that there are 2 paths for i915 userptr. One is the mmu notifier, the other one is the root-only hack we have for dubious reasons (or that I really don't see the point in myself).
For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier.
Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect().
Yeah doing that with a full lock will create endless amounts of issues, but I don't see why we need that. Userspace racing stuff with itself gets to keep all the pieces. This is like racing DIRECT_IO against mprotect and fork.
Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping.
In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation).
Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
Maybe we should have more shared code in this, it seems to be a source of endless amounts of fun ...
I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them.
I'm very much interested in specifically the splat that makes it impossible for you folks to remove the sg mappings. That one sounds bad. And would essentially make mmu_notifiers useless for their primary use case, which is handling virtual machines where you really have to make sure the IOMMU mapping is gone before you claim it's gone, because there's no 2nd level of device checks (like GPU PTEs, or command checker) catching stray writes.
Cheers, Daniel
Regards, Christian.
-Daniel
Regards, Christian.
Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
[SNIP] Note that there are 2 paths for i915 userptr. One is the mmu notifier, the other one is the root-only hack we have for dubious reasons (or that I really don't see the point in myself).
Well I'm referring to i915_gem_userptr.c, if that isn't what you are exposing then just feel free to ignore this whole discussion.
For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier.
Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect().
Yeah doing that with a full lock will create endless amounts of issues, but I don't see why we need that. Userspace racing stuff with itself gets to keep all the pieces. This is like racing DIRECT_IO against mprotect and fork.
First of all I strongly disagree on that. A thread calling fork() because it wants to run a command is not something we can forbid just because we have a gfx stack loaded. That the video driver is not capable of handling that correct is certainly not the problem of userspace.
Second it's not only userspace racing here, you can get into this kind of issues just because of transparent huge page support where the background daemon tries to reallocate the page tables into bigger chunks.
And if I'm not completely mistaken you can also open up quite a bunch of security problems if you suddenly access the wrong page.
Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping.
In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation).
Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed.
To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems.
Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
Maybe we should have more shared code in this, it seems to be a source of endless amounts of fun ...
I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them.
I'm very much interested in specifically the splat that makes it impossible for you folks to remove the sg mappings. That one sounds bad. And would essentially make mmu_notifiers useless for their primary use case, which is handling virtual machines where you really have to make sure the IOMMU mapping is gone before you claim it's gone, because there's no 2nd level of device checks (like GPU PTEs, or command checker) catching stray writes.
Well to be more precise the problem is not that we can't destroy the sg table, but rather that we can't grab the locks to do so.
See when you need to destroy the sg table you usually need to grab the same lock you grabbed when you created it.
And all locks taken while in an MMU notifier can only depend on memory allocation with GFP_NOIO, which is not really feasible for gfx drivers.
Regards, Christian.
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
[SNIP] Note that there are 2 paths for i915 userptr. One is the mmu notifier, the other one is the root-only hack we have for dubious reasons (or that I really don't see the point in myself).
Well I'm referring to i915_gem_userptr.c, if that isn't what you are exposing then just feel free to ignore this whole discussion.
They're both in i915_gem_userptr.c, somewhat interleaved. Would be interesting if you could show what you think is going wrong in there compared to amdgpu_mn.c.
For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier.
Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect().
Yeah doing that with a full lock will create endless amounts of issues, but I don't see why we need that. Userspace racing stuff with itself gets to keep all the pieces. This is like racing DIRECT_IO against mprotect and fork.
First of all I strongly disagree on that. A thread calling fork() because it wants to run a command is not something we can forbid just because we have a gfx stack loaded. That the video driver is not capable of handling that correct is certainly not the problem of userspace.
Second it's not only userspace racing here, you can get into this kind of issues just because of transparent huge page support where the background daemon tries to reallocate the page tables into bigger chunks.
And if I'm not completely mistaken you can also open up quite a bunch of security problems if you suddenly access the wrong page.
I get a feeling we're talking past each another here. Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is fairly convoluted and pushes a lot of stuff to workers (but then syncs with those workers again later on), so easily possible you've overlooked one of these lines that might guarantee already what you think needs to be guaranteed. We're definitely not aiming to allow userspace to allow writing to random pages all over.
Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping.
In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation).
Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed.
To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems.
Yeah, and at the end of the function there's a flush_work to make sure the worker has caught up.
The set_page_dirty is also there, but hidden very deep in the call chain as part of the vma unmapping and backing storage unpinning. But I do think we guarantee what you expect needs to happen.
Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
Maybe we should have more shared code in this, it seems to be a source of endless amounts of fun ...
I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them.
I'm very much interested in specifically the splat that makes it impossible for you folks to remove the sg mappings. That one sounds bad. And would essentially make mmu_notifiers useless for their primary use case, which is handling virtual machines where you really have to make sure the IOMMU mapping is gone before you claim it's gone, because there's no 2nd level of device checks (like GPU PTEs, or command checker) catching stray writes.
Well to be more precise the problem is not that we can't destroy the sg table, but rather that we can't grab the locks to do so.
See when you need to destroy the sg table you usually need to grab the same lock you grabbed when you created it.
And all locks taken while in an MMU notifier can only depend on memory allocation with GFP_NOIO, which is not really feasible for gfx drivers.
I know. i915 gem has tons of fallbacks and retry loops (we restart the entire CS if needed), and i915 userptr pushes the entire get_user_pages dance off into a worker if the fastpath doesn't succeed and we run out of memory or hit contended locks. We also have obscene amounts of __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm code doesn't do something we don't want it do to do in the fastpaths (because there's really no point in spending lots of time trying to make memory available if we have a slowpath fallback with much less constraints).
We're also not limiting ourselves to GFP_NOIO, but instead have a recursion detection&handling in our own shrinker callback to avoid these deadlocks.
It's definitely not easy, and there's lots of horrible code, but it is possible. Just not releasing the sg mappings while you no longer own the memory isn't an option imo. -Daniel
Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
[SNIP] They're both in i915_gem_userptr.c, somewhat interleaved. Would be interesting if you could show what you think is going wrong in there compared to amdgpu_mn.c.
i915 implements only one callback:
static const struct mmu_notifier_ops i915_gem_userptr_notifier = { .invalidate_range_start = i915_gem_userptr_mn_invalidate_range_start, };
For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page.
The next problem seems to be that cancel_userptr() doesn't prevent any new command submission. E.g.
i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
What prevents new command submissions to use the GEM object directly after you finished waiting here?
I get a feeling we're talking past each another here.
Yeah, agree. Additional to that I don't know the i915 code very well.
Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is fairly convoluted and pushes a lot of stuff to workers (but then syncs with those workers again later on), so easily possible you've overlooked one of these lines that might guarantee already what you think needs to be guaranteed. We're definitely not aiming to allow userspace to allow writing to random pages all over.
You not read/write to random pages, there still is a reference to the page. So that the page can't be reused until you are done.
The problem is rather that you can't guarantee that you write to the page which is mapped into the process at that location. E.g. the CPU and the GPU might see two different things.
Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping.
In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation).
Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed.
To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems.
Yeah, and at the end of the function there's a flush_work to make sure the worker has caught up.
Ah, yes haven't seen that.
But then grabbing the obj->base.dev->struct_mutex lock in cancel_userptr() is rather evil. You just silenced lockdep because you offloaded that into a work item.
So no matter how you put it i915 is clearly doing something wrong here :)
I know. i915 gem has tons of fallbacks and retry loops (we restart the entire CS if needed), and i915 userptr pushes the entire get_user_pages dance off into a worker if the fastpath doesn't succeed and we run out of memory or hit contended locks. We also have obscene amounts of __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm code doesn't do something we don't want it do to do in the fastpaths (because there's really no point in spending lots of time trying to make memory available if we have a slowpath fallback with much less constraints).
Well I haven't audited the code, but I'm pretty sure that just mitigates the problem and silenced lockdep instead of really fixing the issue.
We're also not limiting ourselves to GFP_NOIO, but instead have a recursion detection&handling in our own shrinker callback to avoid these deadlocks.
Which if you ask me is absolutely horrible. I mean the comment in linux/mutex.h sums it up pretty well:
* This function should not be used, _ever_. It is purely for hysterical GEM * raisins, and once those are gone this will be removed.
Regards, Christian.
On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
[SNIP] They're both in i915_gem_userptr.c, somewhat interleaved. Would be interesting if you could show what you think is going wrong in there compared to amdgpu_mn.c.
i915 implements only one callback:
static const struct mmu_notifier_ops i915_gem_userptr_notifier = { .invalidate_range_start = i915_gem_userptr_mn_invalidate_range_start, };
For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page.
Is this really a problem? I figured that if a mmu_notifier invalidation is going on, a get_user_pages on that mm from anywhere else (whether i915 or anyone really) will serialize with the ongoing invalidate? If that's not the case, then really any get_user_pages is racy, including all the DIRECT_IO ones.
The next problem seems to be that cancel_userptr() doesn't prevent any new command submission. E.g.
i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
What prevents new command submissions to use the GEM object directly after you finished waiting here?
I get a feeling we're talking past each another here.
Yeah, agree. Additional to that I don't know the i915 code very well.
Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is fairly convoluted and pushes a lot of stuff to workers (but then syncs with those workers again later on), so easily possible you've overlooked one of these lines that might guarantee already what you think needs to be guaranteed. We're definitely not aiming to allow userspace to allow writing to random pages all over.
You not read/write to random pages, there still is a reference to the page. So that the page can't be reused until you are done.
The problem is rather that you can't guarantee that you write to the page which is mapped into the process at that location. E.g. the CPU and the GPU might see two different things.
Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping.
In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation).
Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed.
To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems.
Yeah, and at the end of the function there's a flush_work to make sure the worker has caught up.
Ah, yes haven't seen that.
But then grabbing the obj->base.dev->struct_mutex lock in cancel_userptr() is rather evil. You just silenced lockdep because you offloaded that into a work item.
So no matter how you put it i915 is clearly doing something wrong here :)
tbh I'm not entirely clear on the reasons why this works, but cross-release lockdep catches these things, and it did not complain.
On a high-level we make sure that mm locks needed by get_user_pages do _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to do anything that could fault outside of our own main gem locking.
That was (at least in the past) a major difference with amdgpu, which essentially has none of these paths. That would trivially deadlock with your own gem mmap fault handler, so you had (maybe that changed) a dumb retry loop, which did shut up lockdep but didn't fix any of the locking inversions.
So yeah, grabbing dev->struct_mutex is in principle totally fine while holding all kinds of struct mm/vma locks. I'm not entirely clear why we punt the actual unmapping to the worker though, maybe simply to not have a constrained stack.
I know. i915 gem has tons of fallbacks and retry loops (we restart the entire CS if needed), and i915 userptr pushes the entire get_user_pages dance off into a worker if the fastpath doesn't succeed and we run out of memory or hit contended locks. We also have obscene amounts of __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm code doesn't do something we don't want it do to do in the fastpaths (because there's really no point in spending lots of time trying to make memory available if we have a slowpath fallback with much less constraints).
Well I haven't audited the code, but I'm pretty sure that just mitigates the problem and silenced lockdep instead of really fixing the issue.
This is re: your statement that you can't unamp sg tables from the shrinker. We can, because we've actually untangled the locking depencies so that you can fully operate on gem objects from within mm/vma locks. Maybe code has changed, but last time I looked at radeon/ttm a while back that was totally not the case, and if you don't do all this work then yes you'll deadlock.
Doen't mean it's not impossible, because we've done it :-)
We're also not limiting ourselves to GFP_NOIO, but instead have a recursion detection&handling in our own shrinker callback to avoid these deadlocks.
Which if you ask me is absolutely horrible. I mean the comment in linux/mutex.h sums it up pretty well:
* This function should not be used, _ever_. It is purely for hysterical GEM * raisins, and once those are gone this will be removed.
Well, it actually gets the job done. We'd need to at least get to per-object locking, and probably even then we'd need to rewrite the code a lot. But please note that this here is only to avoid the GFP_NOIO constraint, all the other bits I clarified around why we don't actually have circular locking (because the entire hierarchy is inverted for us) still hold even if you would only trylock here.
Aside: Given that yesterday a bunch of folks complained on #dri-devel that amdgpu prematurely OOMs compared to i915, and that we've switched from a simple trylock to this nastiness to be able to recover from more low memory situation it's maybe not such a silly idea. Horrible, but not silly because actually necessary. -Daniel
Am 22.03.2018 um 08:14 schrieb Daniel Vetter:
On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
[SNIP]
For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page.
Is this really a problem?
Yes, and quite a big one.
I figured that if a mmu_notifier invalidation is going on, a get_user_pages on that mm from anywhere else (whether i915 or anyone really) will serialize with the ongoing invalidate?
No, that isn't correct. Jerome can probably better explain that than I do.
If that's not the case, then really any get_user_pages is racy, including all the DIRECT_IO ones.
The key point here is that get_user_pages() grabs a reference to the page. So what you get is a bunch of pages which where mapped at that location at a specific point in time.
There is no guarantee that after get_user_pages() return you still have the same pages mapped at that point, you only guarantee that the pages are not reused for something else.
That is perfectly sufficient for a task like DIRECT_IO where you can only have block or network I/O, but unfortunately not really for GPUs where you crunch of results, write them back to pages and actually count on that the CPU sees the result in the right place.
[SNIP] So no matter how you put it i915 is clearly doing something wrong here :)
tbh I'm not entirely clear on the reasons why this works, but cross-release lockdep catches these things, and it did not complain. On a high-level we make sure that mm locks needed by get_user_pages do _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to do anything that could fault outside of our own main gem locking.
I'm pretty sure that this doesn't work as intended and just hides the real problem.
That was (at least in the past) a major difference with amdgpu, which essentially has none of these paths. That would trivially deadlock with your own gem mmap fault handler, so you had (maybe that changed) a dumb retry loop, which did shut up lockdep but didn't fix any of the locking inversions.
Any lock you grab in an MMU callback can't be even held when you call kmalloc() or get_free_page() (without GFP_NOIO).
Even simple things like drm_vm_open() violate that by using GFP_KERNEL. So I can 100% ensure you that what you do here is not correct.
So yeah, grabbing dev->struct_mutex is in principle totally fine while holding all kinds of struct mm/vma locks. I'm not entirely clear why we punt the actual unmapping to the worker though, maybe simply to not have a constrained stack.
I strongly disagree on that. As far as I can see what TTM does looks actually like the right approach to the problem.
This is re: your statement that you can't unamp sg tables from the shrinker. We can, because we've actually untangled the locking depencies so that you can fully operate on gem objects from within mm/vma locks. Maybe code has changed, but last time I looked at radeon/ttm a while back that was totally not the case, and if you don't do all this work then yes you'll deadlock.
Doen't mean it's not impossible, because we've done it :-)
And I'm pretty sure you didn't do it correctly :D
Well, it actually gets the job done. We'd need to at least get to per-object locking, and probably even then we'd need to rewrite the code a lot. But please note that this here is only to avoid the GFP_NOIO constraint, all the other bits I clarified around why we don't actually have circular locking (because the entire hierarchy is inverted for us) still hold even if you would only trylock here.
Well you reversed your allocation and mmap_sem lock which avoids the lock inversion during page faults, but it doesn't help you at all with the MMU notifier and shrinker because then there are a lot more locks involved.
Regards, Christian.
Aside: Given that yesterday a bunch of folks complained on #dri-devel that amdgpu prematurely OOMs compared to i915, and that we've switched from a simple trylock to this nastiness to be able to recover from more low memory situation it's maybe not such a silly idea. Horrible, but not silly because actually necessary. -Daniel
On Thu, Mar 22, 2018 at 10:37:55AM +0100, Christian König wrote:
Am 22.03.2018 um 08:14 schrieb Daniel Vetter:
On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
[SNIP]
For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page.
Is this really a problem?
Yes, and quite a big one.
I figured that if a mmu_notifier invalidation is going on, a get_user_pages on that mm from anywhere else (whether i915 or anyone really) will serialize with the ongoing invalidate?
No, that isn't correct. Jerome can probably better explain that than I do.
If that's not the case, then really any get_user_pages is racy, including all the DIRECT_IO ones.
The key point here is that get_user_pages() grabs a reference to the page. So what you get is a bunch of pages which where mapped at that location at a specific point in time.
There is no guarantee that after get_user_pages() return you still have the same pages mapped at that point, you only guarantee that the pages are not reused for something else.
That is perfectly sufficient for a task like DIRECT_IO where you can only have block or network I/O, but unfortunately not really for GPUs where you crunch of results, write them back to pages and actually count on that the CPU sees the result in the right place.
Hm ok, I'll chat with Jerome about this. I thought we have epic amounts of userptr tests, including thrashing the mappings vs gpu activity, so I'm somewhat surprised that this hasn't blown up yet.
[SNIP] So no matter how you put it i915 is clearly doing something wrong here :)
tbh I'm not entirely clear on the reasons why this works, but cross-release lockdep catches these things, and it did not complain. On a high-level we make sure that mm locks needed by get_user_pages do _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to do anything that could fault outside of our own main gem locking.
I'm pretty sure that this doesn't work as intended and just hides the real problem.
That was (at least in the past) a major difference with amdgpu, which essentially has none of these paths. That would trivially deadlock with your own gem mmap fault handler, so you had (maybe that changed) a dumb retry loop, which did shut up lockdep but didn't fix any of the locking inversions.
Any lock you grab in an MMU callback can't be even held when you call kmalloc() or get_free_page() (without GFP_NOIO).
Even simple things like drm_vm_open() violate that by using GFP_KERNEL. So I can 100% ensure you that what you do here is not correct.
drm_vm_open isn't used by modern drivers anymore. We have validated the locking with the cross-release stuff for a few weeks, and it didn't catch stuff. So I'm not worried that the locking is busted, only the mmu notifier vs. get_user_pages races concerns me.
So yeah, grabbing dev->struct_mutex is in principle totally fine while holding all kinds of struct mm/vma locks. I'm not entirely clear why we punt the actual unmapping to the worker though, maybe simply to not have a constrained stack.
I strongly disagree on that. As far as I can see what TTM does looks actually like the right approach to the problem.
This is re: your statement that you can't unamp sg tables from the shrinker. We can, because we've actually untangled the locking depencies so that you can fully operate on gem objects from within mm/vma locks. Maybe code has changed, but last time I looked at radeon/ttm a while back that was totally not the case, and if you don't do all this work then yes you'll deadlock.
Doen't mean it's not impossible, because we've done it :-)
And I'm pretty sure you didn't do it correctly :D
Well, it actually gets the job done. We'd need to at least get to per-object locking, and probably even then we'd need to rewrite the code a lot. But please note that this here is only to avoid the GFP_NOIO constraint, all the other bits I clarified around why we don't actually have circular locking (because the entire hierarchy is inverted for us) still hold even if you would only trylock here.
Well you reversed your allocation and mmap_sem lock which avoids the lock inversion during page faults, but it doesn't help you at all with the MMU notifier and shrinker because then there are a lot more locks involved.
I think we're handling all those locks correctly too. At least the drm_vm_open one is not the locking inversion you're looking for (since we fixed that, iirc even as part of userptr). Which one is it then?
And I disagree that leaking IOMMU mappings around just because we can't untangle the locking rules is good engineering. Those IOMMU mappings need to go, and KVM can pull it off. So should we. -Daniel
Regards, Christian.
Aside: Given that yesterday a bunch of folks complained on #dri-devel that amdgpu prematurely OOMs compared to i915, and that we've switched from a simple trylock to this nastiness to be able to recover from more low memory situation it's maybe not such a silly idea. Horrible, but not silly because actually necessary. -Daniel
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding. -Daniel
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
Regards, Christian.
-Daniel
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
I'm not talking about userptr here, but general bo eviction. Sorry for the confusion.
The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy.
And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle.
That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen? -Daniel
Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
I'm not talking about userptr here, but general bo eviction. Sorry for the confusion.
The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy.
And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle.
That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen?
Correct. I've had the same assumption, but both Jerome as well as our internal tests proved me wrong on that.
Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()).
Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything.
Thinking a moment about that it actually seems to make perfect sense.
So it doesn't matter what order you got between the mmap_sem and your buffer or allocation lock, it will simply be incorrect with other locks in the system anyway.
The only solution to that problem we have found is the dance we have in amdgpu now:
1. Inside invalidate_range_start you grab a recursive read side lock. 2. Wait for all GPU operation on that pages to finish. 3. Then mark all pages as dirty and accessed. 4. Inside invalidate_range_end you release your recursive read side lock again.
Then during command submission you do the following:
1. Take the locks Figure out all the userptr BOs which needs new pages. 2. Drop the locks again call get_user_pages(). 3. Install the new page arrays and reiterate with #1 4. As soon as everybody has pages update your page tables and prepare all command submission. 5. Take the write side of our recursive lock. 6. Check if all pages are still valid, if not restart the whole IOCTL. 7. Submit the job to the hardware. 8. Drop the write side of our recursive lock.
Regards, Christian.
-Daniel
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
I'm not talking about userptr here, but general bo eviction. Sorry for the confusion.
The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy.
And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle.
That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen?
Correct. I've had the same assumption, but both Jerome as well as our internal tests proved me wrong on that.
Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()).
Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything.
That sounds like a design bug in mmu_notifiers, since it would render them useless for KVM. And they were developed for that originally. I think I'll chat with Jerome to understand this, since it's all rather confusing.
Thinking a moment about that it actually seems to make perfect sense. So it doesn't matter what order you got between the mmap_sem and your buffer or allocation lock, it will simply be incorrect with other locks in the system anyway.
Hm, doesn't make sense to me, at least from a locking inversion pov. I thought the only locks that are definitely part of the mmu_nofitier callback contexts are the mm locks. We're definitely fine with those, and kmalloc didn't bite us yet. Direct reclaim is really limited in what it can accomplish for good reasons.
So the only thing I'm worried here is that we have a race between the invalidate and the gup calls, and I think fixing that race will make the locking more fun. But again, if that means you can'd dma_unmap stuff then that feels like mmu notifiers are broken by design. -Daniel
The only solution to that problem we have found is the dance we have in amdgpu now:
- Inside invalidate_range_start you grab a recursive read side lock.
- Wait for all GPU operation on that pages to finish.
- Then mark all pages as dirty and accessed.
- Inside invalidate_range_end you release your recursive read side lock
again.
Then during command submission you do the following:
- Take the locks Figure out all the userptr BOs which needs new pages.
- Drop the locks again call get_user_pages().
- Install the new page arrays and reiterate with #1
- As soon as everybody has pages update your page tables and prepare all
command submission. 5. Take the write side of our recursive lock. 6. Check if all pages are still valid, if not restart the whole IOCTL. 7. Submit the job to the hardware. 8. Drop the write side of our recursive lock.
Regards, Christian.
-Daniel
On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter: > [SNIP] > For the in-driver reservation path (CS) having a slow-path that grabs a > temporary reference, drops the vram lock and then locks the reservation > normally (using the acquire context used already for the entire CS) is a > bit tricky, but totally feasible. Ttm doesn't do that though. That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
I'm not talking about userptr here, but general bo eviction. Sorry for the confusion.
The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy.
And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle.
That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen?
Correct. I've had the same assumption, but both Jerome as well as our internal tests proved me wrong on that.
Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()).
Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything.
That sounds like a design bug in mmu_notifiers, since it would render them useless for KVM. And they were developed for that originally. I think I'll chat with Jerome to understand this, since it's all rather confusing.
Doing dma_unmap() during mmu_notifier callback should be fine, it was last time i check. However there is no formal contract that it is ok to do so. But i want to revisit DMA API to something where device driver get control rather than delegate. High level ideas are: - device driver allocate DMA space ie range of physical bus address - device driver completely manage the range in whichever ways it wants under its own locking. - DMA provide API to flush tlb/cache of a range of physical bus address with the explicit contract that it can only take short lived spinlock to do so and not allocate or free memory while doing so (ie it should only be a bunch of registers writes)
I haven't had time to prototype this but i floated the idea a while back and i am hoping to discuss it some more sooner rather than latter (LSF/MM is coming soon).
So from a safety point of view i would argue that you must be doing dma_unmap() from invalidate_range_start() callback. But it all depends on how confident you are in your hardware mmu and the IOMMU (IOMMU usualy have small cache so unless there is no PCIE traffic all pending PCIE transaction should post quicker than it takes the CPU to finish your invalidate_range_start() function.
So for now i would say it is ok to not dma_unmap().
Thinking a moment about that it actually seems to make perfect sense. So it doesn't matter what order you got between the mmap_sem and your buffer or allocation lock, it will simply be incorrect with other locks in the system anyway.
Hm, doesn't make sense to me, at least from a locking inversion pov. I thought the only locks that are definitely part of the mmu_nofitier callback contexts are the mm locks. We're definitely fine with those, and kmalloc didn't bite us yet. Direct reclaim is really limited in what it can accomplish for good reasons.
So the only thing I'm worried here is that we have a race between the invalidate and the gup calls, and I think fixing that race will make the locking more fun. But again, if that means you can'd dma_unmap stuff then that feels like mmu notifiers are broken by design. -Daniel
The only solution to that problem we have found is the dance we have in amdgpu now:
- Inside invalidate_range_start you grab a recursive read side lock.
- Wait for all GPU operation on that pages to finish.
- Then mark all pages as dirty and accessed.
- Inside invalidate_range_end you release your recursive read side lock
again.
Then during command submission you do the following:
- Take the locks Figure out all the userptr BOs which needs new pages.
- Drop the locks again call get_user_pages().
- Install the new page arrays and reiterate with #1
- As soon as everybody has pages update your page tables and prepare all
command submission. 5. Take the write side of our recursive lock. 6. Check if all pages are still valid, if not restart the whole IOCTL. 7. Submit the job to the hardware. 8. Drop the write side of our recursive lock.
A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(&mydev->notifier_count); lock(mydev->bo_uptr); // protecting bo list/tree for_each_uptr_bo() { invalidate_bo(bo, start, end); } unlock(mydev->bo_uptr); }
driver_range_end() { atomic_inc(&mydev->notifier_seq); atomic_dec(&mydev->notifier_count); }
driver_bo_gup_pages(bo, mydev) { unsigned curseq; do { wait_on_mmu_notifier_count_to_zero(); curseq = atomic_read(&mydev->notifier_seq); for (i = 0; i < npages; i++) { pages[i] = gup(addr + (i << PAGE_SHIFT)); } lock(mydev->bo_uptr); // protecting bo list/tree if (curseq == atomic_read(&mydev->notifier_seq) && !atomic_read(mydev->notifier_count)) { add_uptr_bo(bo, pages); bo->seq = curseq; unlock(mydev->bo_uptr); return 0; } unlock(mydev->bo_uptr); put_all_pages(pages); } while (1); }
Maybe use KVM as a real example of how to do this (see virt/kvm/kvm_main.c)
Note that however my endgame is bit more simpler. Convert amd, intel to use HMM CPU page table snapshot tool (and no you do not need page fault for that you can use that feature independently).
So it would be: driver_bo_hmm_invalidate() { lock(mydev->bo_uptr); // protecting bo list/tree for_each_uptr_bo() { invalidate_bo(bo, start, end); } unlock(mydev->bo_uptr); }
driver_bo_gup_pages(bo, mydev) { do { hmm_vma_fault(range, bo->start, bo->end, bo->pfns); lock(mydev->bo_uptr); if (hmm_vma_range_done(range)) { add_uptr_bo(bo, pfns); } unlock(mydev->bo_uptr); } while (1); }
I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look.
Cheers, Jérôme
Am 26.03.2018 um 17:42 schrieb Jerome Glisse:
On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
Am 22.03.2018 um 08:18 schrieb Daniel Vetter: [SNIP] Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()).
Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything.
That sounds like a design bug in mmu_notifiers, since it would render them useless for KVM. And they were developed for that originally. I think I'll chat with Jerome to understand this, since it's all rather confusing.
Doing dma_unmap() during mmu_notifier callback should be fine, it was last time i check. However there is no formal contract that it is ok to do so.
As I said before dma_unmap() isn't the real problem here.
The issues is more that you can't take a lock in the MMU notifier which you would also take while allocating memory without GFP_NOIO.
That makes it rather tricky to do any command submission, e.g. you need to grab all the pages/memory/resources prehand, then make sure that you don't have a MMU notifier running concurrently and do the submission.
If any of the prerequisites isn't fulfilled we need to restart the operation.
[SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(&mydev->notifier_count);
...
Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear.
I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look.
Certainly a good idea, but I think we might have that separate to HMM.
TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature.
Regards, Christian.
On Tue, Mar 27, 2018 at 09:35:17AM +0200, Christian König wrote:
Am 26.03.2018 um 17:42 schrieb Jerome Glisse:
On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
Am 22.03.2018 um 08:18 schrieb Daniel Vetter: [SNIP] Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()).
Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything.
That sounds like a design bug in mmu_notifiers, since it would render them useless for KVM. And they were developed for that originally. I think I'll chat with Jerome to understand this, since it's all rather confusing.
Doing dma_unmap() during mmu_notifier callback should be fine, it was last time i check. However there is no formal contract that it is ok to do so.
As I said before dma_unmap() isn't the real problem here.
The issues is more that you can't take a lock in the MMU notifier which you would also take while allocating memory without GFP_NOIO.
That makes it rather tricky to do any command submission, e.g. you need to grab all the pages/memory/resources prehand, then make sure that you don't have a MMU notifier running concurrently and do the submission.
If any of the prerequisites isn't fulfilled we need to restart the operation.
Yeah we're hitting all that epic amount of fun now, after a chat with Jerome yesterady. I guess we'll figure out what we're coming up with.
[SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(&mydev->notifier_count);
...
Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear.
I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look.
Certainly a good idea, but I think we might have that separate to HMM.
TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature.
TTM suffered from being a midlayer imo, not from doing too much. HMM is apparently structured like a toolbox (despite its documentation claiming otherwise), so you can pick&choose freely. -Daniel
Am 27.03.2018 um 09:53 schrieb Daniel Vetter:
[SNIP]
[SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(&mydev->notifier_count);
...
Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear.
I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look.
Certainly a good idea, but I think we might have that separate to HMM.
TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature.
TTM suffered from being a midlayer imo, not from doing too much.
Yeah, completely agree.
midlayers work as long as you concentrate on doing exactly one things in your midlayer. E.g. in the case of TTM the callback for BO move handling is well justified.
Only all the stuff around it like address space handling etc... is really wrong designed and should be separated (which is exactly what DRM MM did, but TTM still uses this in the wrong way).
HMM is apparently structured like a toolbox (despite its documentation claiming otherwise), so you can pick&choose freely.
That sounds good, but I would still have a better feeling if userptr handling would be separated. That avoids mangling things together again.
Christian.
-Daniel
On Tue, Mar 27, 2018 at 10:06:04AM +0200, Christian König wrote:
Am 27.03.2018 um 09:53 schrieb Daniel Vetter:
[SNIP]
[SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(&mydev->notifier_count);
...
Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear.
I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look.
Certainly a good idea, but I think we might have that separate to HMM.
TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature.
TTM suffered from being a midlayer imo, not from doing too much.
Yeah, completely agree.
midlayers work as long as you concentrate on doing exactly one things in your midlayer. E.g. in the case of TTM the callback for BO move handling is well justified.
Only all the stuff around it like address space handling etc... is really wrong designed and should be separated (which is exactly what DRM MM did, but TTM still uses this in the wrong way).
Yeah the addres space allocator part of ttm really is backwards and makes adding quick driver hacks and heuristics for better allocations schemes really hard to add. Same for tuning how/what exactly you evict.
HMM is apparently structured like a toolbox (despite its documentation claiming otherwise), so you can pick&choose freely.
That sounds good, but I would still have a better feeling if userptr handling would be separated. That avoids mangling things together again.
Jerome said he wants to do at least one prototype conversion of one of the "I can't fault" userptr implementation over to the suitable subset of HMM helpers. I guess we'll see once he's submitting the patches, but it sounded exactly like what the doctor ordered :-) -Daniel
On Fri, Mar 16, 2018 at 02:20:45PM +0100, Christian König wrote:
Each importer can now provide an invalidate_mappings callback.
This allows the exporter to provide the mappings without the need to pin the backing store.
v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback
Signed-off-by: Christian König christian.koenig@amd.com
Replying here to avoid thread split, but not entirely related.
I thought some more about the lockdep splat discussion, and specifically that amdgpu needs the reservations for the vm objects when doing a gpu reset.
Since they're in the same ww_class as all other dma-buf reservations I'm pretty sure lockdep will complain, at least when cross-release lockdep and cross-release annotations for dma_fence are merged.
And as long as there's some case where amdgpu needs both the vm object reservation and other reservations (CS?) then we must have them in the same class, and in that case the deadlock is real. It'll require an impressive set of circumstances most likely (the minimal number of threads we generally needed to actually hit the cross-release stuff was 4+ or something nuts like that, all doing something else), but it'll be real.
Otoh I think the invalidate stuff here doesn't actually make this worse, so we can bang our heads against the gpu reset problem at leisure :-)
This stuff here has much more potential to interact badly with core mm paths, and anything related to that (which ime also means all the cpu hotplug machinery, which includes suspend/resume, and anything related to the vfs because someone always manages to drag sysfs into the picture).
It's going to be fun times.
Cheers, Daniel
drivers/dma-buf/dma-buf.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 38 ++++++++++++++++++++++++++++++ 2 files changed, 98 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..ed2b3559ba25 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -572,7 +572,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, if (ret) goto err_attach; }
- reservation_object_lock(dmabuf->resv, NULL); list_add(&attach->node, &dmabuf->attachments);
- reservation_object_unlock(dmabuf->resv);
mutex_unlock(&dmabuf->lock); return attach; @@ -598,7 +600,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) return; mutex_lock(&dmabuf->lock);
- reservation_object_lock(dmabuf->resv, NULL); list_del(&attach->node);
- reservation_object_unlock(dmabuf->resv); if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
@@ -632,10 +636,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL);
- /*
* Mapping a DMA-buf can trigger its invalidation, prevent sending this
* event to the caller by temporary removing this attachment from the
* list.
*/
- if (attach->invalidate_mappings) {
reservation_object_assert_held(attach->dmabuf->resv);
list_del(&attach->node);
- }
- sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM);
- if (attach->invalidate_mappings)
list_add(&attach->node, &attach->dmabuf->attachments);
- return sg_table;
} EXPORT_SYMBOL_GPL(dma_buf_map_attachment); @@ -656,6 +673,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep();
- if (attach->invalidate_mappings)
reservation_object_assert_held(attach->dmabuf->resv);
- if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return;
@@ -664,6 +684,44 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/**
- dma_buf_set_invalidate_callback - set the invalidate_mappings callback
- @attach: [in] attachment where to set the callback
- @cb: [in] the callback to set
- Makes sure to take the appropriate locks when updating the invalidate
- mappings callback.
- */
+void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
void (*cb)(struct dma_buf_attachment *))
+{
- reservation_object_lock(attach->dmabuf->resv, NULL);
- attach->invalidate_mappings = cb;
- reservation_object_unlock(attach->dmabuf->resv);
+} +EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback);
+/**
- dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
- @dmabuf: [in] buffer which mappings should be invalidated
- Informs all attachmenst that they need to destroy and recreated all their
- mappings.
- */
+void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{
- struct dma_buf_attachment *attach;
- reservation_object_assert_held(dmabuf->resv);
- list_for_each_entry(attach, &dmabuf->attachments, node)
if (attach->invalidate_mappings)
attach->invalidate_mappings(attach);
+} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings);
/**
- DOC: cpu access
@@ -1121,10 +1179,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_puts(s, "\tAttached Devices:\n"); attach_count = 0;
list_for_each_entry(attach_obj, &buf_obj->attachments, node) { seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); attach_count++; }reservation_object_lock(buf_obj->resv, NULL);
reservation_object_unlock(buf_obj->resv);
seq_printf(s, "Total %d devices attached\n\n", attach_count); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 085db2fee2d7..70c65fcfe1e3 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -91,6 +91,18 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
- /**
* @supports_mapping_invalidation:
*
* True for exporters which supports unpinned DMA-buf operation using
* the reservation lock.
*
* When attachment->invalidate_mappings is set the @map_dma_buf and
* @unmap_dma_buf callbacks can be called with the reservation lock
* held.
*/
- bool supports_mapping_invalidation;
- /**
- @map_dma_buf:
@@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv;
- /**
* @invalidate_mappings:
*
* Optional callback provided by the importer of the attachment which
* must be set before mappings are created.
*
* If provided the exporter can avoid pinning the backing store while
* mappings exists.
*
* The function is called with the lock of the reservation object
* associated with the dma_buf held and the mapping function must be
* called with this lock held as well. This makes sure that no mapping
* is created concurrently with an ongoing invalidation.
*
* After the callback all existing mappings are still valid until all
* fences in the dma_bufs reservation object are signaled, but should be
* destroyed by the importer as soon as possible.
*
* New mappings can be created immediately, but can't be used before the
* exclusive fence in the dma_bufs reservation object is signaled.
*/
- void (*invalidate_mappings)(struct dma_buf_attachment *attach);
}; /** @@ -391,6 +426,9 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
void (*cb)(struct dma_buf_attachment *));
+void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf, -- 2.14.1
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Make sure the transfered BO is never destroy before the transfer BO.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/ttm/ttm_bo_util.c | 50 +++++++++++++++++++++++---------------- 1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1f730b3f18e5..1ee20558ee31 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -39,6 +39,11 @@ #include <linux/module.h> #include <linux/reservation.h>
+struct ttm_transfer_obj { + struct ttm_buffer_object base; + struct ttm_buffer_object *bo; +}; + void ttm_bo_free_old_node(struct ttm_buffer_object *bo) { ttm_bo_mem_put(bo, &bo->mem); @@ -435,7 +440,11 @@ EXPORT_SYMBOL(ttm_bo_move_memcpy);
static void ttm_transfered_destroy(struct ttm_buffer_object *bo) { - kfree(bo); + struct ttm_transfer_obj *fbo; + + fbo = container_of(bo, struct ttm_transfer_obj, base); + ttm_bo_unref(&fbo->bo); + kfree(fbo); }
/** @@ -456,14 +465,15 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, struct ttm_buffer_object **new_obj) { - struct ttm_buffer_object *fbo; + struct ttm_transfer_obj *fbo; int ret;
fbo = kmalloc(sizeof(*fbo), GFP_KERNEL); if (!fbo) return -ENOMEM;
- *fbo = *bo; + fbo->base = *bo; + fbo->bo = ttm_bo_reference(bo);
/** * Fix up members that we shouldn't copy directly: @@ -471,25 +481,25 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, */
atomic_inc(&bo->bdev->glob->bo_count); - INIT_LIST_HEAD(&fbo->ddestroy); - INIT_LIST_HEAD(&fbo->lru); - INIT_LIST_HEAD(&fbo->swap); - INIT_LIST_HEAD(&fbo->io_reserve_lru); - mutex_init(&fbo->wu_mutex); - fbo->moving = NULL; - drm_vma_node_reset(&fbo->vma_node); - atomic_set(&fbo->cpu_writers, 0); - - kref_init(&fbo->list_kref); - kref_init(&fbo->kref); - fbo->destroy = &ttm_transfered_destroy; - fbo->acc_size = 0; - fbo->resv = &fbo->ttm_resv; - reservation_object_init(fbo->resv); - ret = reservation_object_trylock(fbo->resv); + INIT_LIST_HEAD(&fbo->base.ddestroy); + INIT_LIST_HEAD(&fbo->base.lru); + INIT_LIST_HEAD(&fbo->base.swap); + INIT_LIST_HEAD(&fbo->base.io_reserve_lru); + mutex_init(&fbo->base.wu_mutex); + fbo->base.moving = NULL; + drm_vma_node_reset(&fbo->base.vma_node); + atomic_set(&fbo->base.cpu_writers, 0); + + kref_init(&fbo->base.list_kref); + kref_init(&fbo->base.kref); + fbo->base.destroy = &ttm_transfered_destroy; + fbo->base.acc_size = 0; + fbo->base.resv = &fbo->base.ttm_resv; + reservation_object_init(fbo->base.resv); + ret = reservation_object_trylock(fbo->base.resv); WARN_ON(!ret);
- *new_obj = fbo; + *new_obj = &fbo->base; return 0; }
Pipeline removal of the BOs backing store when the placement is given during validation.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 98e06f8bf23b..17e821f01d0a 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1078,6 +1078,18 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, uint32_t new_flags;
reservation_object_assert_held(bo->resv); + + /* + * Remove the backing store if no placement is given. + */ + if (!placement->num_placement && !placement->num_busy_placement) { + ret = ttm_bo_pipeline_gutting(bo); + if (ret) + return ret; + + return ttm_tt_create(bo, false); + } + /* * Check whether we need to move buffer. */
Instead of relying on the DRM functions just implement our own export functions. This adds support for taking care of unpinned DMA-buf.
v2: fix unintended recursion, remove debugging leftovers
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 110 ++++++++++++++++++----------- 4 files changed, 73 insertions(+), 44 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 96bcdb97e7e2..909dc9764a22 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -371,7 +371,6 @@ int amdgpu_gem_object_open(struct drm_gem_object *obj, void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv); unsigned long amdgpu_gem_timeout(uint64_t timeout_ns); -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj); struct drm_gem_object * amdgpu_gem_prime_import_sg_table(struct drm_device *dev, struct dma_buf_attachment *attach, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index e6709362994a..e32dcdf8d3db 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -886,7 +886,6 @@ static struct drm_driver kms_driver = { .gem_prime_export = amdgpu_gem_prime_export, .gem_prime_import = amdgpu_gem_prime_import, .gem_prime_res_obj = amdgpu_gem_prime_res_obj, - .gem_prime_get_sg_table = amdgpu_gem_prime_get_sg_table, .gem_prime_import_sg_table = amdgpu_gem_prime_import_sg_table, .gem_prime_vmap = amdgpu_gem_prime_vmap, .gem_prime_vunmap = amdgpu_gem_prime_vunmap, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 48e0115d4f76..18483a79dcdd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include <linux/list.h> #include <linux/slab.h> +#include <linux/dma-buf.h> #include <drm/drmP.h> #include <drm/amdgpu_drm.h> #include <drm/drm_cache.h> @@ -931,6 +932,10 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
amdgpu_bo_kunmap(abo);
+ if (abo->gem_base.dma_buf && !abo->gem_base.import_attach && + bo->mem.mem_type != TTM_PL_SYSTEM) + dma_buf_invalidate_mappings(abo->gem_base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(&adev->num_evictions); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 1c9991738477..7b9edbc938eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -32,14 +32,6 @@
static const struct dma_buf_ops amdgpu_dmabuf_ops;
-struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj) -{ - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int npages = bo->tbo.num_pages; - - return drm_prime_pages_to_sg(bo->tbo.ttm->pages, npages); -} - void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj) { struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); @@ -126,22 +118,21 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); }
-static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, - struct device *target_dev, - struct dma_buf_attachment *attach) +static struct sg_table * +amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct sg_table *sgt; long r;
- r = drm_gem_map_attach(dma_buf, target_dev, attach); - if (r) - return r; - - r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto error_detach; - + if (!attach->invalidate_mappings) { + r = amdgpu_bo_reserve(bo, false); + if (unlikely(r != 0)) + return ERR_PTR(r); + }
if (dma_buf->ops != &amdgpu_dmabuf_ops) { /* @@ -157,41 +148,77 @@ static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, } }
- /* pin buffer into GTT */ - r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); - if (r) + if (!attach->invalidate_mappings) { + /* pin buffer into GTT */ + r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); + if (r) + goto error_unreserve; + + } else { + /* move buffer into GTT */ + struct ttm_operation_ctx ctx = { false, false }; + + amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT); + r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); + if (r) + goto error_unreserve; + } + + sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages, bo->tbo.num_pages); + if (IS_ERR(sgt)) { + r = PTR_ERR(sgt); goto error_unreserve; + } + + if (!dma_map_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir, + DMA_ATTR_SKIP_CPU_SYNC)) + goto error_free;
if (dma_buf->ops != &amdgpu_dmabuf_ops) bo->prime_shared_count++;
+ if (!attach->invalidate_mappings) + amdgpu_bo_unreserve(bo); + + return sgt; + +error_free: + sg_free_table(sgt); + kfree(sgt); + r = -ENOMEM; + error_unreserve: - amdgpu_bo_unreserve(bo); + if (!attach->invalidate_mappings) + amdgpu_bo_unreserve(bo);
-error_detach: - if (r) - drm_gem_map_detach(dma_buf, attach); - return r; + return ERR_PTR(r); }
-static void amdgpu_gem_map_detach(struct dma_buf *dma_buf, - struct dma_buf_attachment *attach) +static void amdgpu_gem_unmap_dma_buf(struct dma_buf_attachment *attach, + struct sg_table *sgt, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int ret = 0;
- ret = amdgpu_bo_reserve(bo, true); - if (unlikely(ret != 0)) - goto error; + if (!attach->invalidate_mappings) { + amdgpu_bo_reserve(bo, true); + amdgpu_bo_unpin(bo); + }
- amdgpu_bo_unpin(bo); - if (dma_buf->ops != &amdgpu_dmabuf_ops && bo->prime_shared_count) + if (dma_buf->ops != &amdgpu_dmabuf_ops && + bo->prime_shared_count) bo->prime_shared_count--; - amdgpu_bo_unreserve(bo);
-error: - drm_gem_map_detach(dma_buf, attach); + if (!attach->invalidate_mappings) + amdgpu_bo_unreserve(bo); + + if (sgt) { + dma_unmap_sg(attach->dev, sgt->sgl, sgt->nents, dir); + sg_free_table(sgt); + kfree(sgt); + } }
struct reservation_object *amdgpu_gem_prime_res_obj(struct drm_gem_object *obj) @@ -230,10 +257,9 @@ static int amdgpu_gem_begin_cpu_access(struct dma_buf *dma_buf, }
static const struct dma_buf_ops amdgpu_dmabuf_ops = { - .attach = amdgpu_gem_map_attach, - .detach = amdgpu_gem_map_detach, - .map_dma_buf = drm_gem_map_dma_buf, - .unmap_dma_buf = drm_gem_unmap_dma_buf, + .supports_mapping_invalidation = true, + .map_dma_buf = amdgpu_gem_map_dma_buf, + .unmap_dma_buf = amdgpu_gem_unmap_dma_buf, .release = drm_gem_dmabuf_release, .begin_cpu_access = amdgpu_gem_begin_cpu_access, .map = drm_gem_dmabuf_kmap,
Instead of relying on the DRM functions just implement our own import functions. This adds support for taking care of unpinned DMA-buf.
v2: enable for all exporters, not just amdgpu, fix invalidation handling, lock reservation object while setting callback
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 44 ++++++++++++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 32 ++++++++++++++++++---- 2 files changed, 70 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 7b9edbc938eb..cb6ec9b7844a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -291,10 +291,26 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, return buf; }
+static void +amdgpu_gem_prime_invalidate_mappings(struct dma_buf_attachment *attach) +{ + struct ttm_operation_ctx ctx = { false, false }; + struct drm_gem_object *obj = attach->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct ttm_placement placement = {}; + int r; + + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); + if (r) + DRM_ERROR("Failed to unmap DMA-buf import (%d))\n", r); +} + struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) { + struct dma_buf_attachment *attach; struct drm_gem_object *obj; + int ret;
if (dma_buf->ops == &amdgpu_dmabuf_ops) { obj = dma_buf->priv; @@ -308,5 +324,31 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, } }
- return drm_gem_prime_import(dev, dma_buf); + if (!dma_buf->ops->supports_mapping_invalidation) + return drm_gem_prime_import(dev, dma_buf); + + attach = dma_buf_attach(dma_buf, dev->dev); + if (IS_ERR(attach)) + return ERR_CAST(attach); + + get_dma_buf(dma_buf); + + obj = amdgpu_gem_prime_import_sg_table(dev, attach, NULL); + if (IS_ERR(obj)) { + ret = PTR_ERR(obj); + goto fail_detach; + } + + obj->import_attach = attach; + attach->priv = obj; + + dma_buf_set_invalidate_callback(attach, + amdgpu_gem_prime_invalidate_mappings); + return obj; + +fail_detach: + dma_buf_detach(dma_buf, attach); + dma_buf_put(dma_buf); + + return ERR_PTR(ret); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 291dd3d600cd..aeead0281e92 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -43,6 +43,7 @@ #include <linux/pagemap.h> #include <linux/debugfs.h> #include <linux/iommu.h> +#include <linux/dma-buf.h> #include "amdgpu.h" #include "amdgpu_object.h" #include "amdgpu_trace.h" @@ -685,6 +686,7 @@ struct amdgpu_ttm_gup_task_list {
struct amdgpu_ttm_tt { struct ttm_dma_tt ttm; + struct amdgpu_bo *bo; u64 offset; uint64_t userptr; struct mm_struct *usermm; @@ -993,6 +995,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, return NULL; } gtt->ttm.ttm.func = &amdgpu_backend_func; + gtt->bo = ttm_to_amdgpu_bo(bo); if (ttm_sg_tt_init(>t->ttm, bo, page_flags)) { kfree(gtt); return NULL; @@ -1005,7 +1008,6 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, { struct amdgpu_device *adev = amdgpu_ttm_adev(ttm->bdev); struct amdgpu_ttm_tt *gtt = (void *)ttm; - bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG);
if (gtt && gtt->userptr) { ttm->sg = kzalloc(sizeof(struct sg_table), GFP_KERNEL); @@ -1017,7 +1019,19 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, return 0; }
- if (slave && ttm->sg) { + if (ttm->page_flags & TTM_PAGE_FLAG_SG) { + if (!ttm->sg) { + struct dma_buf_attachment *attach; + struct sg_table *sgt; + + attach = gtt->bo->gem_base.import_attach; + sgt = dma_buf_map_attachment(attach, DMA_BIDIRECTIONAL); + if (IS_ERR(sgt)) + return PTR_ERR(sgt); + + ttm->sg = sgt; + } + drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages, gtt->ttm.dma_address, ttm->num_pages); @@ -1036,9 +1050,8 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm,
static void amdgpu_ttm_tt_unpopulate(struct ttm_tt *ttm) { - struct amdgpu_device *adev; struct amdgpu_ttm_tt *gtt = (void *)ttm; - bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG); + struct amdgpu_device *adev;
if (gtt && gtt->userptr) { amdgpu_ttm_tt_set_user_pages(ttm, NULL); @@ -1047,7 +1060,16 @@ static void amdgpu_ttm_tt_unpopulate(struct ttm_tt *ttm) return; }
- if (slave) + if (ttm->sg && !gtt->bo->tbo.sg) { + struct dma_buf_attachment *attach; + + attach = gtt->bo->gem_base.import_attach; + dma_buf_unmap_attachment(attach, ttm->sg, DMA_BIDIRECTIONAL); + ttm->sg = NULL; + return; + } + + if (ttm->page_flags & TTM_PAGE_FLAG_SG) return;
adev = amdgpu_ttm_adev(ttm->bdev);
On Fri, Mar 16, 2018 at 02:20:44PM +0100, Christian König wrote:
Hi everybody,
since I've got positive feedback from Daniel I continued working on this approach.
A few issues are still open:
- Daniel suggested that I make the invalidate_mappings callback a parameter of dma_buf_attach().
This approach unfortunately won't work because when the attachment is created the importer is not necessarily ready to handle invalidation events.
Why do you have this constraint? This sounds a bit like inverted create/teardown sequence troubles, where you make an object "life" before the thing is fully set up.
Can't we fix this by creating the entire ttm scaffolding you'll need for a dma-buf upfront, and only once you have everything we grab the dma_buf attachment? At that point you really should be able to evict buffers again.
Not requiring invalidate_mapping to be set together with the attachment means we can't ever require importers to support it (e.g. to address your concern with the userspace dma-buf userptr magic).
E.g. in the amdgpu example we first need to setup the imported GEM/TMM objects and install that in the attachment.
My solution is to introduce a separate function to grab the locks and set the callback, this function could then be used to pin the buffer later on if that turns out to be necessary after all.
- With my example setup this currently results in a ping/pong situation
because the exporter prefers a VRAM placement while the importer prefers a GTT placement.
This results in quite a performance drop, but can be fixed by a simple mesa patch which allows shred BOs to be placed in both VRAM and GTT.
Question is what should we do in the meantime? Accept the performance drop or only allow unpinned sharing with new Mesa?
Maybe the exporter should not try to move stuff back into VRAM as long as there's an active dma-buf? I mean it's really cool that it works, but maybe let's just do this for a tech demo :-)
Of course if it then runs out of TT then it could still try to move it back in. And "let's not move it when it's imported" is probably too stupid too, and will need to be improved again with more heuristics, but would at least get it off the ground.
Long term you might want to move perhaps once per 10 seconds or so, to get idle importers to detach. Adjust 10s to match whatever benchmark/workload you care about. -Daniel
linaro-mm-sig@lists.linaro.org