Changelog: v5: * Documented the DMA-BUF expectations around DMA unmap. * Added wait support in VFIO for DMA unmap. * Reordered patches. * Improved commit messages to document even more. v4: https://lore.kernel.org/all/20260121-dmabuf-revoke-v4-0-d311cbc8633d@nvidia.... * Changed DMA_RESV_USAGE_KERNEL to DMA_RESV_USAGE_BOOKKEEP. * Made .invalidate_mapping() truly optional. * Added patch which renames dma_buf_move_notify() to be dma_buf_invalidate_mappings(). * Restored dma_buf_attachment_is_dynamic() function. v3: https://lore.kernel.org/all/20260120-dmabuf-revoke-v3-0-b7e0b07b8214@nvidia.... * Used Jason's wordings for commits and cover letter. * Removed IOMMUFD patch. * Renamed dma_buf_attachment_is_revoke() to be dma_buf_attach_revocable(). * Added patch to remove CONFIG_DMABUF_MOVE_NOTIFY. * Added Reviewed-by tags. * Called to dma_resv_wait_timeout() after dma_buf_move_notify() in VFIO. * Added dma_buf_attach_revocable() check to VFIO DMABUF attach function. * Slightly changed commit messages. v2: https://patch.msgid.link/20260118-dmabuf-revoke-v2-0-a03bb27c0875@nvidia.com * Changed series to document the revoke semantics instead of implementing it. v1: https://patch.msgid.link/20260111-dmabuf-revoke-v1-0-fb4bcc8c259b@nvidia.com
------------------------------------------------------------------------- This series is based on latest VFIO fix, which will be sent to Linus very soon.
https://lore.kernel.org/all/20260121-vfio-add-pin-v1-1-4e04916b17f1@nvidia.c...
Thanks ------------------------------------------------------------------------- This series documents a dma-buf “revoke” mechanism: to allow a dma-buf exporter to explicitly invalidate (“kill”) a shared buffer after it has been distributed to importers, so that further CPU and device access is prevented and importers reliably observe failure.
The change in this series is to properly document and use existing core “revoked” state on the dma-buf object and a corresponding exporter-triggered revoke operation.
dma-buf has quietly allowed calling move_notify on pinned dma-bufs, even though legacy importers using dma_buf_attach() would simply ignore these calls.
The intention was that move_notify() would tell the importer to expedite it's unmapping process and once the importer is fully finished with DMA it would unmap the dma-buf which finally signals that the importer is no longer ever going to touch the memory again. Importers that touch past their unmap() call can trigger IOMMU errors, AER and beyond, however read-and-discard access between move_notify() and unmap is allowed.
Thus, we can define the exporter's revoke sequence for pinned dma-buf as:
dma_resv_lock(dmabuf->resv, NULL); // Prevent new mappings from being established priv->revoked = true;
// Tell all importers to eventually unmap dma_buf_invalidate_mappings(dmabuf);
// Wait for any inprogress fences on the old mapping dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP, false, MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(dmabuf->resv, NULL);
// Wait for all importers to complete unmap wait_for_completion(&priv->unmapp_comp);
However, dma-buf also supports importers that don't do anything on move_notify(), and will not unmap the buffer in bounded time.
Since such importers would cause the above sequence to hang, a new mechanism is needed to detect incompatible importers.
Introduce dma_buf_attach_revocable() which if true indicates the above sequence is safe to use and will complete in kernel-only bounded time for this attachment.
Unfortunately dma_buf_attach_revocable() is going to fail for the popular RDMA pinned importer, which means we cannot introduce it to existing places using pinned move_notify() without potentially breaking existing userspace flows.
Existing exporters that only trigger this flow for RAS errors should not call dma_buf_attach_revocable() and will suffer an unbounded block on the final completion, hoping that the userspace will notice the RAS and clean things up. Without revoke support on the RDMA pinned importers it doesn't seem like any other non-breaking option is currently possible.
For new exporters, like VFIO and RDMA, that have userspace triggered revoke events, the unbouned sleep would not be acceptable. They can call dma_buf_attach_revocable() and will not work with the RDMA pinned importer from day 0, preventing regressions.
In the process add documentation explaining the above details.
Thanks
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- Leon Romanovsky (8): dma-buf: Rename .move_notify() callback to a clearer identifier dma-buf: Rename dma_buf_move_notify() to dma_buf_invalidate_mappings() dma-buf: Always build with DMABUF_MOVE_NOTIFY vfio: Wait for dma-buf invalidation to complete dma-buf: Make .invalidate_mapping() truly optional dma-buf: Add dma_buf_attach_revocable() vfio: Permit VFIO to work with pinned importers iommufd: Add dma_buf_pin()
drivers/dma-buf/Kconfig | 12 ---- drivers/dma-buf/dma-buf.c | 69 +++++++++++++++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +- drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +- drivers/gpu/drm/xe/tests/xe_dma_buf.c | 7 +-- drivers/gpu/drm/xe/xe_bo.c | 2 +- drivers/gpu/drm/xe/xe_dma_buf.c | 14 ++--- drivers/infiniband/core/umem_dmabuf.c | 13 ----- drivers/infiniband/hw/mlx5/mr.c | 2 +- drivers/iommu/iommufd/pages.c | 11 +++- drivers/iommu/iommufd/selftest.c | 2 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 90 +++++++++++++++++++++++------ include/linux/dma-buf.h | 17 +++--- 15 files changed, 164 insertions(+), 95 deletions(-) --- base-commit: 61ceaf236115f20f4fdd7cf60f883ada1063349a change-id: 20251221-dmabuf-revoke-b90ef16e4236
Best regards, -- Leon Romanovsky leonro@nvidia.com
From: Leon Romanovsky leonro@nvidia.com
Rename the .move_notify() callback to .invalidate_mappings() to make its purpose explicit and highlight that it is responsible for invalidating existing mappings.
Suggested-by: Christian König christian.koenig@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/dma-buf/dma-buf.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 4 ++-- drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +- drivers/gpu/drm/xe/tests/xe_dma_buf.c | 6 +++--- drivers/gpu/drm/xe/xe_dma_buf.c | 2 +- drivers/infiniband/core/umem_dmabuf.c | 4 ++-- drivers/infiniband/hw/mlx5/mr.c | 2 +- drivers/iommu/iommufd/pages.c | 2 +- include/linux/dma-buf.h | 6 +++--- 9 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index edaa9e4ee4ae..59cc647bf40e 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -948,7 +948,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL);
- if (WARN_ON(importer_ops && !importer_ops->move_notify)) + if (WARN_ON(importer_ops && !importer_ops->invalidate_mappings)) return ERR_PTR(-EINVAL);
attach = kzalloc(sizeof(*attach), GFP_KERNEL); @@ -1055,7 +1055,7 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_pin, "DMA_BUF"); * * This unpins a buffer pinned by dma_buf_pin() and allows the exporter to move * any mapping of @attach again and inform the importer through - * &dma_buf_attach_ops.move_notify. + * &dma_buf_attach_ops.invalidate_mappings. */ void dma_buf_unpin(struct dma_buf_attachment *attach) { @@ -1262,7 +1262,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
list_for_each_entry(attach, &dmabuf->attachments, node) if (attach->importer_ops) - attach->importer_ops->move_notify(attach); + attach->importer_ops->invalidate_mappings(attach); } EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, "DMA_BUF");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index c1461317eb29..cd4944ceb047 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -438,7 +438,7 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) }
/** - * amdgpu_dma_buf_move_notify - &attach.move_notify implementation + * amdgpu_dma_buf_move_notify - &attach.invalidate_mappings implementation * * @attach: the DMA-buf attachment * @@ -509,7 +509,7 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = { .allow_peer2peer = true, - .move_notify = amdgpu_dma_buf_move_notify + .invalidate_mappings = amdgpu_dma_buf_move_notify };
/** diff --git a/drivers/gpu/drm/virtio/virtgpu_prime.c b/drivers/gpu/drm/virtio/virtgpu_prime.c index ce49282198cb..19c78dd2ca77 100644 --- a/drivers/gpu/drm/virtio/virtgpu_prime.c +++ b/drivers/gpu/drm/virtio/virtgpu_prime.c @@ -288,7 +288,7 @@ static void virtgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
static const struct dma_buf_attach_ops virtgpu_dma_buf_attach_ops = { .allow_peer2peer = true, - .move_notify = virtgpu_dma_buf_move_notify + .invalidate_mappings = virtgpu_dma_buf_move_notify };
struct drm_gem_object *virtgpu_gem_prime_import(struct drm_device *dev, diff --git a/drivers/gpu/drm/xe/tests/xe_dma_buf.c b/drivers/gpu/drm/xe/tests/xe_dma_buf.c index 5df98de5ba3c..1f2cca5c2f81 100644 --- a/drivers/gpu/drm/xe/tests/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/tests/xe_dma_buf.c @@ -23,7 +23,7 @@ static bool p2p_enabled(struct dma_buf_test_params *params) static bool is_dynamic(struct dma_buf_test_params *params) { return IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY) && params->attach_ops && - params->attach_ops->move_notify; + params->attach_ops->invalidate_mappings; }
static void check_residency(struct kunit *test, struct xe_bo *exported, @@ -60,7 +60,7 @@ static void check_residency(struct kunit *test, struct xe_bo *exported,
/* * Evict exporter. Evicting the exported bo will - * evict also the imported bo through the move_notify() functionality if + * evict also the imported bo through the invalidate_mappings() functionality if * importer is on a different device. If they're on the same device, * the exporter and the importer should be the same bo. */ @@ -198,7 +198,7 @@ static void xe_test_dmabuf_import_same_driver(struct xe_device *xe)
static const struct dma_buf_attach_ops nop2p_attach_ops = { .allow_peer2peer = false, - .move_notify = xe_dma_buf_move_notify + .invalidate_mappings = xe_dma_buf_move_notify };
/* diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c index 7c74a31d4486..1b9cd043e517 100644 --- a/drivers/gpu/drm/xe/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/xe_dma_buf.c @@ -287,7 +287,7 @@ static void xe_dma_buf_move_notify(struct dma_buf_attachment *attach)
static const struct dma_buf_attach_ops xe_dma_buf_attach_ops = { .allow_peer2peer = true, - .move_notify = xe_dma_buf_move_notify + .invalidate_mappings = xe_dma_buf_move_notify };
#if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST) diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index 0ec2e4120cc9..d77a739cfe7a 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -129,7 +129,7 @@ ib_umem_dmabuf_get_with_dma_device(struct ib_device *device, if (check_add_overflow(offset, (unsigned long)size, &end)) return ret;
- if (unlikely(!ops || !ops->move_notify)) + if (unlikely(!ops || !ops->invalidate_mappings)) return ret;
dmabuf = dma_buf_get(fd); @@ -195,7 +195,7 @@ ib_umem_dmabuf_unsupported_move_notify(struct dma_buf_attachment *attach)
static struct dma_buf_attach_ops ib_umem_dmabuf_attach_pinned_ops = { .allow_peer2peer = true, - .move_notify = ib_umem_dmabuf_unsupported_move_notify, + .invalidate_mappings = ib_umem_dmabuf_unsupported_move_notify, };
struct ib_umem_dmabuf * diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 325fa04cbe8a..97099d3b1688 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -1620,7 +1620,7 @@ static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment *attach)
static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = { .allow_peer2peer = 1, - .move_notify = mlx5_ib_dmabuf_invalidate_cb, + .invalidate_mappings = mlx5_ib_dmabuf_invalidate_cb, };
static struct ib_mr * diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index dbe51ecb9a20..76f900fa1687 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -1451,7 +1451,7 @@ static void iopt_revoke_notify(struct dma_buf_attachment *attach)
static struct dma_buf_attach_ops iopt_dmabuf_attach_revoke_ops = { .allow_peer2peer = true, - .move_notify = iopt_revoke_notify, + .invalidate_mappings = iopt_revoke_notify, };
/* diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 0bc492090237..1b397635c793 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -407,7 +407,7 @@ struct dma_buf { * through the device. * * - Dynamic importers should set fences for any access that they can't - * disable immediately from their &dma_buf_attach_ops.move_notify + * disable immediately from their &dma_buf_attach_ops.invalidate_mappings * callback. * * IMPORTANT: @@ -458,7 +458,7 @@ struct dma_buf_attach_ops { bool allow_peer2peer;
/** - * @move_notify: [optional] notification that the DMA-buf is moving + * @invalidate_mappings: [optional] notification that the DMA-buf is moving * * If this callback is provided the framework can avoid pinning the * backing store while mappings exists. @@ -475,7 +475,7 @@ struct dma_buf_attach_ops { * New mappings can be created after this callback returns, and will * point to the new location of the DMA-buf. */ - void (*move_notify)(struct dma_buf_attachment *attach); + void (*invalidate_mappings)(struct dma_buf_attachment *attach); };
/**
From: Leon Romanovsky leonro@nvidia.com
Along with renaming the .move_notify() callback, rename the corresponding dma-buf core function. This makes the expected behavior clear to exporters calling this function.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/dma-buf/dma-buf.c | 8 ++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/xe/xe_bo.c | 2 +- drivers/iommu/iommufd/selftest.c | 2 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 4 ++-- include/linux/dma-buf.h | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 59cc647bf40e..e12db540c413 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -912,7 +912,7 @@ dma_buf_pin_on_map(struct dma_buf_attachment *attach) * 3. Exporters must hold the dma-buf reservation lock when calling these * functions: * - * - dma_buf_move_notify() + * - dma_buf_invalidate_mappings() */
/** @@ -1247,14 +1247,14 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
/** - * dma_buf_move_notify - notify attachments that DMA-buf is moving + * dma_buf_invalidate_mappings - notify attachments that DMA-buf is moving * * @dmabuf: [in] buffer which is moving * * Informs all attachments that they need to destroy and recreate all their * mappings. */ -void dma_buf_move_notify(struct dma_buf *dmabuf) +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) { struct dma_buf_attachment *attach;
@@ -1264,7 +1264,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf) if (attach->importer_ops) attach->importer_ops->invalidate_mappings(attach); } -EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, "DMA_BUF"); +EXPORT_SYMBOL_NS_GPL(dma_buf_invalidate_mappings, "DMA_BUF");
/** * DOC: cpu access diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index e08f58de4b17..f73dc99d1887 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1270,7 +1270,7 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
if (abo->tbo.base.dma_buf && !drm_gem_is_imported(&abo->tbo.base) && old_mem && old_mem->mem_type != TTM_PL_SYSTEM) - dma_buf_move_notify(abo->tbo.base.dma_buf); + dma_buf_invalidate_mappings(abo->tbo.base.dma_buf);
/* move_notify is called before move happens */ trace_amdgpu_bo_move(abo, new_mem ? new_mem->mem_type : -1, diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index bf4ee976b680..7d02cd9a8501 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -819,7 +819,7 @@ static int xe_bo_move_notify(struct xe_bo *bo,
/* Don't call move_notify() for imported dma-bufs. */ if (ttm_bo->base.dma_buf && !ttm_bo->base.import_attach) - dma_buf_move_notify(ttm_bo->base.dma_buf); + dma_buf_invalidate_mappings(ttm_bo->base.dma_buf);
/* * TTM has already nuked the mmap for us (see ttm_bo_unmap_virtual), diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 550ff36dec3a..f60cbd5328cc 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -2081,7 +2081,7 @@ static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd, priv = dmabuf->priv; dma_resv_lock(dmabuf->resv, NULL); priv->revoked = revoked; - dma_buf_move_notify(dmabuf); + dma_buf_invalidate_mappings(dmabuf); dma_resv_unlock(dmabuf->resv);
err_put: diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 4be4a85005cb..d8ceafabef48 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -332,7 +332,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (priv->revoked != revoked) { dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; - dma_buf_move_notify(priv->dmabuf); + dma_buf_invalidate_mappings(priv->dmabuf); dma_resv_unlock(priv->dmabuf->resv); } fput(priv->dmabuf->file); @@ -353,7 +353,7 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) list_del_init(&priv->dmabufs_elm); priv->vdev = NULL; priv->revoked = true; - dma_buf_move_notify(priv->dmabuf); + dma_buf_invalidate_mappings(priv->dmabuf); dma_resv_unlock(priv->dmabuf->resv); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 1b397635c793..d5c3ce2b3aa4 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -600,7 +600,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); -void dma_buf_move_notify(struct dma_buf *dma_buf); +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Along with renaming the .move_notify() callback, rename the corresponding dma-buf core function. This makes the expected behavior clear to exporters calling this function.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
drivers/dma-buf/dma-buf.c | 8 ++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/xe/xe_bo.c | 2 +- drivers/iommu/iommufd/selftest.c | 2 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 4 ++-- include/linux/dma-buf.h | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 59cc647bf40e..e12db540c413 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -912,7 +912,7 @@ dma_buf_pin_on_map(struct dma_buf_attachment *attach)
- Exporters must hold the dma-buf reservation lock when calling these
- functions:
- dma_buf_move_notify()
*/
- dma_buf_invalidate_mappings()/** @@ -1247,14 +1247,14 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); /**
- dma_buf_move_notify - notify attachments that DMA-buf is moving
*/
- dma_buf_invalidate_mappings - notify attachments that DMA-buf is moving
- @dmabuf: [in] buffer which is moving
- Informs all attachments that they need to destroy and recreate all their
- mappings.
-void dma_buf_move_notify(struct dma_buf *dmabuf) +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) { struct dma_buf_attachment *attach; @@ -1264,7 +1264,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf) if (attach->importer_ops) attach->importer_ops->invalidate_mappings(attach); } -EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, "DMA_BUF"); +EXPORT_SYMBOL_NS_GPL(dma_buf_invalidate_mappings, "DMA_BUF"); /**
- DOC: cpu access
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index e08f58de4b17..f73dc99d1887 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1270,7 +1270,7 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, if (abo->tbo.base.dma_buf && !drm_gem_is_imported(&abo->tbo.base) && old_mem && old_mem->mem_type != TTM_PL_SYSTEM)
dma_buf_move_notify(abo->tbo.base.dma_buf);
dma_buf_invalidate_mappings(abo->tbo.base.dma_buf);/* move_notify is called before move happens */ trace_amdgpu_bo_move(abo, new_mem ? new_mem->mem_type : -1, diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index bf4ee976b680..7d02cd9a8501 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -819,7 +819,7 @@ static int xe_bo_move_notify(struct xe_bo *bo, /* Don't call move_notify() for imported dma-bufs. */ if (ttm_bo->base.dma_buf && !ttm_bo->base.import_attach)
dma_buf_move_notify(ttm_bo->base.dma_buf);
dma_buf_invalidate_mappings(ttm_bo->base.dma_buf);/* * TTM has already nuked the mmap for us (see ttm_bo_unmap_virtual), diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 550ff36dec3a..f60cbd5328cc 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -2081,7 +2081,7 @@ static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd, priv = dmabuf->priv; dma_resv_lock(dmabuf->resv, NULL); priv->revoked = revoked;
- dma_buf_move_notify(dmabuf);
- dma_buf_invalidate_mappings(dmabuf); dma_resv_unlock(dmabuf->resv);
err_put: diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 4be4a85005cb..d8ceafabef48 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -332,7 +332,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (priv->revoked != revoked) { dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked;
dma_buf_move_notify(priv->dmabuf);
} fput(priv->dmabuf->file);dma_buf_invalidate_mappings(priv->dmabuf); dma_resv_unlock(priv->dmabuf->resv);@@ -353,7 +353,7 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) list_del_init(&priv->dmabufs_elm); priv->vdev = NULL; priv->revoked = true;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file);dma_buf_invalidate_mappings(priv->dmabuf);diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 1b397635c793..d5c3ce2b3aa4 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -600,7 +600,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); -void dma_buf_move_notify(struct dma_buf *dma_buf); +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/dma-buf/Kconfig | 12 ------------ drivers/dma-buf/dma-buf.c | 3 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 10 +++------- drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +- drivers/gpu/drm/xe/tests/xe_dma_buf.c | 3 +-- drivers/gpu/drm/xe/xe_dma_buf.c | 12 ++++-------- 6 files changed, 10 insertions(+), 32 deletions(-)
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig index b46eb8a552d7..84d5e9b24e20 100644 --- a/drivers/dma-buf/Kconfig +++ b/drivers/dma-buf/Kconfig @@ -40,18 +40,6 @@ config UDMABUF A driver to let userspace turn memfd regions into dma-bufs. Qemu can use this to create host dmabufs for guest framebuffers.
-config DMABUF_MOVE_NOTIFY - bool "Move notify between drivers (EXPERIMENTAL)" - default n - depends on DMA_SHARED_BUFFER - help - Don't pin buffers if the dynamic DMA-buf interface is available on - both the exporter as well as the importer. This fixes a security - problem where userspace is able to pin unrestricted amounts of memory - through DMA-buf. - This is marked experimental because we don't yet have a consistent - execution context and memory management between drivers. - config DMABUF_DEBUG bool "DMA-BUF debug checks" depends on DMA_SHARED_BUFFER diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index e12db540c413..cd68c1c0bfd7 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -847,8 +847,7 @@ static bool dma_buf_pin_on_map(struct dma_buf_attachment *attach) { return attach->dmabuf->ops->pin && - (!dma_buf_attachment_is_dynamic(attach) || - !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)); + !dma_buf_attachment_is_dynamic(attach); }
/** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index cd4944ceb047..b7f85b8653cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -133,13 +133,9 @@ static int amdgpu_dma_buf_pin(struct dma_buf_attachment *attach) * notifiers are disabled, only allow pinning in VRAM when move * notiers are enabled. */ - if (!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) { - domains &= ~AMDGPU_GEM_DOMAIN_VRAM; - } else { - list_for_each_entry(attach, &dmabuf->attachments, node) - if (!attach->peer2peer) - domains &= ~AMDGPU_GEM_DOMAIN_VRAM; - } + list_for_each_entry(attach, &dmabuf->attachments, node) + if (!attach->peer2peer) + domains &= ~AMDGPU_GEM_DOMAIN_VRAM;
if (domains & AMDGPU_GEM_DOMAIN_VRAM) bo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED; diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig index 16e12c9913f9..a5d7467c2f34 100644 --- a/drivers/gpu/drm/amd/amdkfd/Kconfig +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig @@ -27,7 +27,7 @@ config HSA_AMD_SVM
config HSA_AMD_P2P bool "HSA kernel driver support for peer-to-peer for AMD GPU devices" - depends on HSA_AMD && PCI_P2PDMA && DMABUF_MOVE_NOTIFY + depends on HSA_AMD && PCI_P2PDMA help Enable peer-to-peer (P2P) communication between AMD GPUs over the PCIe bus. This can improve performance of multi-GPU compute diff --git a/drivers/gpu/drm/xe/tests/xe_dma_buf.c b/drivers/gpu/drm/xe/tests/xe_dma_buf.c index 1f2cca5c2f81..c107687ef3c0 100644 --- a/drivers/gpu/drm/xe/tests/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/tests/xe_dma_buf.c @@ -22,8 +22,7 @@ static bool p2p_enabled(struct dma_buf_test_params *params)
static bool is_dynamic(struct dma_buf_test_params *params) { - return IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY) && params->attach_ops && - params->attach_ops->invalidate_mappings; + return params->attach_ops && params->attach_ops->invalidate_mappings; }
static void check_residency(struct kunit *test, struct xe_bo *exported, diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c index 1b9cd043e517..ea370cd373e9 100644 --- a/drivers/gpu/drm/xe/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/xe_dma_buf.c @@ -56,14 +56,10 @@ static int xe_dma_buf_pin(struct dma_buf_attachment *attach) bool allow_vram = true; int ret;
- if (!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) { - allow_vram = false; - } else { - list_for_each_entry(attach, &dmabuf->attachments, node) { - if (!attach->peer2peer) { - allow_vram = false; - break; - } + list_for_each_entry(attach, &dmabuf->attachments, node) { + if (!attach->peer2peer) { + allow_vram = false; + break; } }
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
They are basically just cleanups we should have done a long time ago.
Regards, Christian.
drivers/dma-buf/Kconfig | 12 ------------ drivers/dma-buf/dma-buf.c | 3 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 10 +++------- drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +- drivers/gpu/drm/xe/tests/xe_dma_buf.c | 3 +-- drivers/gpu/drm/xe/xe_dma_buf.c | 12 ++++-------- 6 files changed, 10 insertions(+), 32 deletions(-)
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig index b46eb8a552d7..84d5e9b24e20 100644 --- a/drivers/dma-buf/Kconfig +++ b/drivers/dma-buf/Kconfig @@ -40,18 +40,6 @@ config UDMABUF A driver to let userspace turn memfd regions into dma-bufs. Qemu can use this to create host dmabufs for guest framebuffers. -config DMABUF_MOVE_NOTIFY
- bool "Move notify between drivers (EXPERIMENTAL)"
- default n
- depends on DMA_SHARED_BUFFER
- help
Don't pin buffers if the dynamic DMA-buf interface is available onboth the exporter as well as the importer. This fixes a securityproblem where userspace is able to pin unrestricted amounts of memorythrough DMA-buf.This is marked experimental because we don't yet have a consistentexecution context and memory management between drivers.config DMABUF_DEBUG bool "DMA-BUF debug checks" depends on DMA_SHARED_BUFFER diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index e12db540c413..cd68c1c0bfd7 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -847,8 +847,7 @@ static bool dma_buf_pin_on_map(struct dma_buf_attachment *attach) { return attach->dmabuf->ops->pin &&
(!dma_buf_attachment_is_dynamic(attach) ||!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY));
!dma_buf_attachment_is_dynamic(attach);} /** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index cd4944ceb047..b7f85b8653cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -133,13 +133,9 @@ static int amdgpu_dma_buf_pin(struct dma_buf_attachment *attach) * notifiers are disabled, only allow pinning in VRAM when move * notiers are enabled. */
- if (!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) {
domains &= ~AMDGPU_GEM_DOMAIN_VRAM;- } else {
list_for_each_entry(attach, &dmabuf->attachments, node)if (!attach->peer2peer)domains &= ~AMDGPU_GEM_DOMAIN_VRAM;- }
- list_for_each_entry(attach, &dmabuf->attachments, node)
if (!attach->peer2peer)domains &= ~AMDGPU_GEM_DOMAIN_VRAM;if (domains & AMDGPU_GEM_DOMAIN_VRAM) bo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED; diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig index 16e12c9913f9..a5d7467c2f34 100644 --- a/drivers/gpu/drm/amd/amdkfd/Kconfig +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig @@ -27,7 +27,7 @@ config HSA_AMD_SVM config HSA_AMD_P2P bool "HSA kernel driver support for peer-to-peer for AMD GPU devices"
- depends on HSA_AMD && PCI_P2PDMA && DMABUF_MOVE_NOTIFY
- depends on HSA_AMD && PCI_P2PDMA help Enable peer-to-peer (P2P) communication between AMD GPUs over the PCIe bus. This can improve performance of multi-GPU compute
diff --git a/drivers/gpu/drm/xe/tests/xe_dma_buf.c b/drivers/gpu/drm/xe/tests/xe_dma_buf.c index 1f2cca5c2f81..c107687ef3c0 100644 --- a/drivers/gpu/drm/xe/tests/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/tests/xe_dma_buf.c @@ -22,8 +22,7 @@ static bool p2p_enabled(struct dma_buf_test_params *params) static bool is_dynamic(struct dma_buf_test_params *params) {
- return IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY) && params->attach_ops &&
params->attach_ops->invalidate_mappings;
- return params->attach_ops && params->attach_ops->invalidate_mappings;
} static void check_residency(struct kunit *test, struct xe_bo *exported, diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c index 1b9cd043e517..ea370cd373e9 100644 --- a/drivers/gpu/drm/xe/xe_dma_buf.c +++ b/drivers/gpu/drm/xe/xe_dma_buf.c @@ -56,14 +56,10 @@ static int xe_dma_buf_pin(struct dma_buf_attachment *attach) bool allow_vram = true; int ret;
- if (!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) {
allow_vram = false;- } else {
list_for_each_entry(attach, &dmabuf->attachments, node) {if (!attach->peer2peer) {allow_vram = false;break;}
- list_for_each_entry(attach, &dmabuf->attachments, node) {
if (!attach->peer2peer) {allow_vram = false; } }break;
On Tue, Jan 27, 2026 at 10:26:26AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
How do you see progress of other patches? Can they be queued for that tree as well?
Thanks
They are basically just cleanups we should have done a long time ago.
Regards, Christian.
On 1/27/26 10:58, Leon Romanovsky wrote:
On Tue, Jan 27, 2026 at 10:26:26AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
How do you see progress of other patches? Can they be queued for that tree as well?
I was hoping to get through them by the end of the week.
Just wanted to make sure that CI systems start to see the first three patches (who affect everybody), so that we get early feedback should we have missed something.
Regards, Christian.
Thanks
They are basically just cleanups we should have done a long time ago.
Regards, Christian.
On Tue, Jan 27, 2026 at 11:02:03AM +0100, Christian König wrote:
On 1/27/26 10:58, Leon Romanovsky wrote:
On Tue, Jan 27, 2026 at 10:26:26AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
How do you see progress of other patches? Can they be queued for that tree as well?
I was hoping to get through them by the end of the week.
Just wanted to make sure that CI systems start to see the first three patches (who affect everybody), so that we get early feedback should we have missed something.
Perfect, I based my patches on these two commits: 61ceaf236115 (vfio/for-linus) vfio: Prevent from pinned DMABUF importers to attach to VFIO DMABUF 24d479d26b25 (tag: v6.19-rc6) Linux 6.19-rc6
Thanks
Regards, Christian.
Thanks
They are basically just cleanups we should have done a long time ago.
Regards, Christian.
On Tue, Jan 27, 2026 at 01:42:43PM +0200, Leon Romanovsky wrote:
On Tue, Jan 27, 2026 at 11:02:03AM +0100, Christian König wrote:
On 1/27/26 10:58, Leon Romanovsky wrote:
On Tue, Jan 27, 2026 at 10:26:26AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
How do you see progress of other patches? Can they be queued for that tree as well?
I was hoping to get through them by the end of the week.
Just wanted to make sure that CI systems start to see the first three patches (who affect everybody), so that we get early feedback should we have missed something.
Perfect, I based my patches on these two commits: 61ceaf236115 (vfio/for-linus) vfio: Prevent from pinned DMABUF importers to attach to VFIO DMABUF 24d479d26b25 (tag: v6.19-rc6) Linux 6.19-rc6
The fix was merged https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
Thanks
Thanks
Regards, Christian.
Thanks
They are basically just cleanups we should have done a long time ago.
Regards, Christian.
On Tue, Jan 27, 2026 at 11:02:03AM +0100, Christian König wrote:
On 1/27/26 10:58, Leon Romanovsky wrote:
On Tue, Jan 27, 2026 at 10:26:26AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
DMABUF_MOVE_NOTIFY was introduced in 2018 and has been marked as experimental and disabled by default ever since. Six years later, all new importers implement this callback.
It is therefore reasonable to drop CONFIG_DMABUF_MOVE_NOTIFY and always build DMABUF with support for it enabled.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Christian König christian.koenig@amd.com
As long as nobody starts screaming in the last second or I encounter some other problem I'm going to push the first three patches to drm-misc-next now.
How do you see progress of other patches? Can they be queued for that tree as well?
I was hoping to get through them by the end of the week.
Christian,
Can we please merge the whole series now?
Thanks
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges; + struct kref kref; + struct completion comp; u8 revoked : 1; };
@@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; }
+static void vfio_pci_dma_buf_done(struct kref *kref) +{ + struct vfio_pci_dma_buf *priv = + container_of(kref, struct vfio_pci_dma_buf, kref); + + complete(&priv->comp); +} + static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv);
if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider, - priv->phys_vec, priv->nr_ranges, - priv->size, dir); + ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider, + priv->phys_vec, priv->nr_ranges, + priv->size, dir); + if (IS_ERR(ret)) + return ret; + + kref_get(&priv->kref); + return ret; }
static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) { + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + + dma_resv_assert_held(priv->dmabuf->resv); + dma_buf_free_sgt(attachment, sgt, dir); + kref_put(&priv->kref, vfio_pci_dma_buf_done); }
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
+ kref_init(&priv->kref); + init_completion(&priv->comp); + /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock); @@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + unsigned long wait; + if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf); + dma_resv_wait_timeout(priv->dmabuf->resv, + DMA_RESV_USAGE_BOOKKEEP, false, + MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv); + if (revoked) { + kref_put(&priv->kref, vfio_pci_dma_buf_done); + /* Let's wait till all DMA unmap are completed. */ + wait = wait_for_completion_timeout( + &priv->comp, secs_to_jiffies(1)); + /* + * If you see this WARN_ON, it means that + * importer didn't call unmap in response to + * dma_buf_invalidate_mappings() which is not + * allowed. + */ + WARN(!wait, + "Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()"); + } else { + /* + * Kref is initialize again, because when revoke + * was performed the reference counter was decreased + * to zero to trigger completion. + */ + kref_init(&priv->kref); + /* + * There is no need to wait as no mapping was + * performed when the previous status was + * priv->revoked == true. + */ + reinit_completion(&priv->comp); + } } fput(priv->dmabuf->file); } @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + unsigned long wait; + if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf); + dma_resv_wait_timeout(priv->dmabuf->resv, + DMA_RESV_USAGE_BOOKKEEP, false, + MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv); + kref_put(&priv->kref, vfio_pci_dma_buf_done); + wait = wait_for_completion_timeout(&priv->comp, + secs_to_jiffies(1)); + WARN_ON(!wait); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }
On Sat, Jan 24, 2026 at 09:14:16PM +0200, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges;
- struct kref kref;
- struct completion comp; u8 revoked : 1;
}; @@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; } +static void vfio_pci_dma_buf_done(struct kref *kref) +{
- struct vfio_pci_dma_buf *priv =
container_of(kref, struct vfio_pci_dma_buf, kref);- complete(&priv->comp);
+}
static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv); if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);
- ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);- if (IS_ERR(ret))
return ret;- kref_get(&priv->kref);
- return ret;
} static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) {
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- dma_buf_free_sgt(attachment, sgt, dir);
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
} static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
- kref_init(&priv->kref);
- init_completion(&priv->comp);
- /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock);
@@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));
Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case where a device's firmware has a high response latency, perhaps due to internal management tasks like error recovery or thermal and it exceeds the 1s timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001 seconds risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could compromise system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
Thanks Praan
/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file); }}@@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);-- 2.52.0
On Mon, Jan 26, 2026 at 08:53:57PM +0000, Pranjal Shrivastava wrote:
On Sat, Jan 24, 2026 at 09:14:16PM +0200, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
<...>
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case where a device's firmware has a high response latency, perhaps due to internal management tasks like error recovery or thermal and it exceeds the 1s timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001 seconds risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could compromise system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call ensures that the hardware has completed all pending operations. The 1‑second delay is meant to catch cases where an in-kernel DMA unmap call is missing, which should not trigger any DMA activity at that point.
So yes, one second is more than sufficient.
Thanks
Thanks Praan
/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file); }}@@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);-- 2.52.0
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case where a device's firmware has a high response latency, perhaps due to internal management tasks like error recovery or thermal and it exceeds the 1s timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001 seconds risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could compromise system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call ensures that the hardware has completed all pending operations. The 1‑second delay is meant to catch cases where an in-kernel DMA unmap call is missing, which should not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
I think the open question here is if drivers that can't rely on their fences should return dma_buf_attach_revocable() = false ? It depends on how long they will leave the buffers mapped, and if it is "bounded time".
The converse is we want to detect bugs where drivers have wrongly set dma_buf_attach_revocable() = true and this turns into an infinite sleep, so the logging is necessary, IMHO.
At worst the code should sleep 1s, print, then keep sleeping..
Jason
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP,false,
MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref,vfio_pci_dma_buf_done);
/* Let's wait till all DMA unmap arecompleted. */
wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case where a device's firmware has a high response latency, perhaps due to internal management tasks like error recovery or thermal and it exceeds the 1s timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001 seconds risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could compromise system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call
ensures
that the hardware has completed all pending operations. The 1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing, which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
and it'd be helpful to put some words in the code based on what's discussed here.
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP,false,
MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref,vfio_pci_dma_buf_done);
/* Let's wait till all DMA unmap arecompleted. */
wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case where a device's firmware has a high response latency, perhaps due to internal management tasks like error recovery or thermal and it exceeds the 1s timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001 seconds risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could compromise system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call
ensures
that the hardware has completed all pending operations. The 1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing, which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the description of patch 6 and in VFIO implementation. The difference is that the timeout is enforced within VFIO.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken hardware that continues to issue DMA transactions even after it has been told to stop.
The second wait exists to catch software bugs or misuse, where the dma-buf importer has misrepresented its capabilities.
and it'd be helpful to put some words in the code based on what's discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and dma_buf_invalidate_mappings(). Do you have any suggestions on what else should be added here?
Thanks
From: Leon Romanovsky leon@kernel.org Sent: Thursday, January 29, 2026 3:34 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv-dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP,
false,
MAX_SCHEDULE_TIMEOUT);
dma_resv_unlock(priv->dmabuf->resv);
if (revoked) {kref_put(&priv->kref,vfio_pci_dma_buf_done);
/* Let's wait till all DMA unmap arecompleted. */
wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1));
Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case
where
a device's firmware has a high response latency, perhaps due to
internal
management tasks like error recovery or thermal and it exceeds the
1s
timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001
seconds
risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could
compromise
system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call
ensures
that the hardware has completed all pending operations. The 1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing,
which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the description of patch 6 and in VFIO implementation. The difference is that the timeout is enforced within VFIO.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken hardware that continues to issue DMA transactions even after it has been told to stop.
The second wait exists to catch software bugs or misuse, where the dma-buf importer has misrepresented its capabilities.
Okay I see.
and it'd be helpful to put some words in the code based on what's discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and dma_buf_invalidate_mappings(). Do you have any suggestions on what else should be added here?
the selection of 1s?
then,
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, Jan 29, 2026 at 08:13:18AM +0000, Tian, Kevin wrote:
From: Leon Romanovsky leon@kernel.org Sent: Thursday, January 29, 2026 3:34 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
> dma_resv_lock(priv->dmabuf->resv, NULL); > priv->revoked = revoked; > dma_buf_invalidate_mappings(priv-
dmabuf);
> + dma_resv_wait_timeout(priv->dmabuf->resv, > +
DMA_RESV_USAGE_BOOKKEEP,
false,
> +
MAX_SCHEDULE_TIMEOUT);
> dma_resv_unlock(priv->dmabuf->resv); > + if (revoked) { > + kref_put(&priv->kref,
vfio_pci_dma_buf_done);
> + /* Let's wait till all DMA unmap are
completed. */
> + wait = wait_for_completion_timeout( > + &priv->comp,
secs_to_jiffies(1));
Is the 1-second constant sufficient for all hardware, or should the invalidate_mappings() contract require the callback to block until speculative reads are strictly fenced? I'm wondering about a case
where
a device's firmware has a high response latency, perhaps due to
internal
management tasks like error recovery or thermal and it exceeds the
1s
timeout.
If the device is in the middle of a large DMA burst and the firmware is slow to flush the internal pipelines to a fully "quiesced" read-and-discard state, reclaiming the memory at exactly 1.001
seconds
risks triggering platform-level faults..
Since the wen explicitly permit these speculative reads until unmap is complete, relying on a hardcoded timeout in the exporter seems to introduce a hardware-dependent race condition that could
compromise
system stability via IOMMU errors or AER faults.
Should the importer instead be required to guarantee that all speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call
ensures
that the hardware has completed all pending operations. The 1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing,
which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the description of patch 6 and in VFIO implementation. The difference is that the timeout is enforced within VFIO.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken hardware that continues to issue DMA transactions even after it has been told to stop.
The second wait exists to catch software bugs or misuse, where the dma-buf importer has misrepresented its capabilities.
Okay I see.
and it'd be helpful to put some words in the code based on what's discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and dma_buf_invalidate_mappings(). Do you have any suggestions on what else should be added here?
the selection of 1s?
It is indirectly written in description of WARN_ON(), but let's add more. What about the following?
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 93795ad2e025..948ba75288c6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,7 +357,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); - /* Let's wait till all DMA unmap are completed. */ + /* + * Let's wait for 1 second till all DMA unmap + * are completed. It is supposed to catch dma-buf + * importers which lied about their support + * of dmabuf revoke. See dma_buf_invalidate_mappings() + * for the expected behaviour, + */ wait = wait_for_completion_timeout( &priv->comp, secs_to_jiffies(1)); /*
then,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Thanks
On Thu, 29 Jan 2026 10:41:56 +0200 Leon Romanovsky leon@kernel.org wrote:
On Thu, Jan 29, 2026 at 08:13:18AM +0000, Tian, Kevin wrote:
From: Leon Romanovsky leon@kernel.org Sent: Thursday, January 29, 2026 3:34 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
> > dma_resv_lock(priv->dmabuf->resv, NULL); > > priv->revoked = revoked; > > dma_buf_invalidate_mappings(priv-
dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv, > > +
DMA_RESV_USAGE_BOOKKEEP,
false,
> > +
MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv); > > + if (revoked) { > > + kref_put(&priv->kref,
vfio_pci_dma_buf_done);
> > + /* Let's wait till all DMA unmap are
completed. */
> > + wait = wait_for_completion_timeout( > > + &priv->comp,
secs_to_jiffies(1));
> > Is the 1-second constant sufficient for all hardware, or should the > invalidate_mappings() contract require the callback to block until > speculative reads are strictly fenced? I'm wondering about a case
where
> a device's firmware has a high response latency, perhaps due to
internal
> management tasks like error recovery or thermal and it exceeds the
1s
> timeout. > > If the device is in the middle of a large DMA burst and the firmware is > slow to flush the internal pipelines to a fully "quiesced" > read-and-discard state, reclaiming the memory at exactly 1.001
seconds
> risks triggering platform-level faults.. > > Since the wen explicitly permit these speculative reads until unmap is > complete, relying on a hardcoded timeout in the exporter seems to > introduce a hardware-dependent race condition that could
compromise
> system stability via IOMMU errors or AER faults. > > Should the importer instead be required to guarantee that all > speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call
ensures
that the hardware has completed all pending operations. The 1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing,
which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access. So it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the description of patch 6 and in VFIO implementation. The difference is that the timeout is enforced within VFIO.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken hardware that continues to issue DMA transactions even after it has been told to stop.
The second wait exists to catch software bugs or misuse, where the dma-buf importer has misrepresented its capabilities.
Okay I see.
and it'd be helpful to put some words in the code based on what's discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and dma_buf_invalidate_mappings(). Do you have any suggestions on what else should be added here?
the selection of 1s?
It is indirectly written in description of WARN_ON(), but let's add more. What about the following?
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 93795ad2e025..948ba75288c6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,7 +357,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done);
/* Let's wait till all DMA unmap are completed. */
/** Let's wait for 1 second till all DMA unmap* are completed. It is supposed to catch dma-buf* importers which lied about their support* of dmabuf revoke. See dma_buf_invalidate_mappings()* for the expected behaviour,*/ wait = wait_for_completion_timeout( &priv->comp, secs_to_jiffies(1)); /*then,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Reviewed-by: Alex Williamson alex@shazbot.org
From: Leon Romanovsky leon@kernel.org Sent: Thursday, January 29, 2026 4:42 PM
On Thu, Jan 29, 2026 at 08:13:18AM +0000, Tian, Kevin wrote:
From: Leon Romanovsky leon@kernel.org Sent: Thursday, January 29, 2026 3:34 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Wednesday, January 28, 2026 12:28 AM
On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
vfio_pci_core_device *vdev, bool revoked)
> > dma_resv_lock(priv->dmabuf->resv, NULL); > > priv->revoked = revoked; > > dma_buf_invalidate_mappings(priv-
dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv, > > +
DMA_RESV_USAGE_BOOKKEEP,
false,
> > +
MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv); > > + if (revoked) { > > + kref_put(&priv->kref,
vfio_pci_dma_buf_done);
> > + /* Let's wait till all DMA unmap are
completed. */
> > + wait = wait_for_completion_timeout( > > + &priv->comp,
secs_to_jiffies(1));
> > Is the 1-second constant sufficient for all hardware, or should the > invalidate_mappings() contract require the callback to block until > speculative reads are strictly fenced? I'm wondering about a case
where
> a device's firmware has a high response latency, perhaps due to
internal
> management tasks like error recovery or thermal and it exceeds
the
1s
> timeout. > > If the device is in the middle of a large DMA burst and the
firmware is
> slow to flush the internal pipelines to a fully "quiesced" > read-and-discard state, reclaiming the memory at exactly 1.001
seconds
> risks triggering platform-level faults.. > > Since the wen explicitly permit these speculative reads until
unmap is
> complete, relying on a hardcoded timeout in the exporter seems
to
> introduce a hardware-dependent race condition that could
compromise
> system stability via IOMMU errors or AER faults. > > Should the importer instead be required to guarantee that all > speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That
call
ensures
that the hardware has completed all pending operations. The
1‑second
delay is
meant to catch cases where an in-kernel DMA unmap call is missing,
which
should
not trigger any DMA activity at that point.
Christian may know actual examples, but my general feeling is he was worrying about drivers that have pushed the DMABUF to visibility on the GPU and the move notify & fences only shoot down some access.
So
it has to wait until the DMABUF is finally unmapped.
Pranjal's example should be covered by the driver adding a fence and then the unbounded fence wait will complete it.
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the description of patch 6 and in VFIO implementation. The difference is that the timeout is enforced within VFIO.
Then why is "the unbounded fence wait" not a problem in the same code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken hardware that continues to issue DMA transactions even after it has been told to stop.
The second wait exists to catch software bugs or misuse, where the dma-
buf
importer has misrepresented its capabilities.
Okay I see.
and it'd be helpful to put some words in the code based on what's discussed here.
We've documented as much as we can in dma_buf_attach_revocable()
and
dma_buf_invalidate_mappings(). Do you have any suggestions on what
else
should be added here?
the selection of 1s?
It is indirectly written in description of WARN_ON(), but let's add more. What about the following?
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 93795ad2e025..948ba75288c6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,7 +357,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done);
/* Let's wait till all DMA unmap are completed. */
/** Let's wait for 1 second till all DMA unmap* are completed. It is supposed to catch dma-buf* importers which lied about their support* of dmabuf revoke. See dma_buf_invalidate_mappings()* for the expected behaviour,*/ wait = wait_for_completion_timeout( &priv->comp, secs_to_jiffies(1)); /*
looks good. Just replace the trailing "," with "."
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete eventually under kernel control, because these sleeps are sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at least this will work "correctly" even if it takes a huge amount of time.
I wouldn't mind seeing a shorter timeout and print on the fence too just in case.
Jason
From: Jason Gunthorpe jgg@ziepe.ca Sent: Thursday, January 29, 2026 10:59 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete eventually under kernel control, because these sleeps are sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at least this will work "correctly" even if it takes a huge amount of time.
good to know those background.
I wouldn't mind seeing a shorter timeout and print on the fence too just in case.
either way is OK. It's not difficult to figure out a long wait anyway. 😊
On Fri, 30 Jan 2026 03:12:02 +0000 "Tian, Kevin" kevin.tian@intel.com wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Thursday, January 29, 2026 10:59 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete eventually under kernel control, because these sleeps are sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at least this will work "correctly" even if it takes a huge amount of time.
good to know those background.
I wouldn't mind seeing a shorter timeout and print on the fence too just in case.
either way is OK. It's not difficult to figure out a long wait anyway. 😊
Please don't use Outlook when answering to patches - or ensure that it is properly patched to only send plain text - which I don't think it is possible.
If you look on this message source code, it is not in plain text:
Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64
Your message content is:
PiBGcm9tOiBKYXNvbiBHdW50aG9ycGUgPGpnZ0B6aWVwZS5jYT4NCj4gU2VudDogVGh1cnNkYXks IEphbnVhcnkgMjksIDIwMjYgMTA6NTkgUE0NCj4gDQo+IE9uIFRodSwgSmFuIDI5LCAyMDI2IGF0 IDA3OjA2OjM3QU0gKzAwMDAsIFRpYW4sIEtldmluIHdyb3RlOg0KPiA+IEJlYXIgbWUgaWYgaXQn cyBhbiBpZ25vcmFudCBxdWVzdGlvbi4NCj4gPg0KPiA+IFRoZSBjb21taXQgbXNnIG9mIHBhdGNo NiBzYXlzIHRoYXQgVkZJTyBkb2Vzbid0IHRvbGVyYXRlIHVuYm91bmRlZA0KPiA+IHdhaXQsIHdo aWNoIGlzIHRoZSByZWFzb24gYmVoaW5kIHRoZSAybmQgdGltZW91dCB3YWl0IGhlcmUuDQo+IA0K PiBBcyBmYXIgYXMgSSB1bmRlcnN0YW5kIGRtYWJ1ZiBkZXNpZ24gYSBmZW5jZSB3YWl0IHNob3Vs ZCBjb21wbGV0ZQ0KPiBldmVudHVhbGx5IHVuZGVyIGtlcm5lbCBjb250cm9sLCBiZWNhdXNlIHRo ZXNlIHNsZWVwcyBhcmUNCj4gc3ByaW5rbGVkIGFsbCBhcm91bmQgdGhlIGtlcm5lbCB0b2RheS4N Cj4gDQo+IEkgc3VzcGVjdCB0aGF0IGlzIG5vdCBhY3R1YWxseSB0cnVlIGZvciBldmVyeSBIVywg cHJvYmFibHkgc29tZXRoaW5nDQo+IGxpa2UgInNoYWRlciBwcm9ncmFtcyBjYW4gcnVuIGZvcmV2 ZXIgdGVjaG5pY2FsbHkiLg0KPiANCj4gV2UgY2FuIGFyZ3VlIGlmIHRob3NlIGNhc2VzIHNob3Vs ZCBub3QgcmVwb3J0IHJldm9jYWJsZSBlaXRoZXIsIGJ1dCBhdA0KPiBsZWFzdCB0aGlzIHdpbGwg d29yayAiY29ycmVjdGx5IiBldmVuIGlmIGl0IHRha2VzIGEgaHVnZSBhbW91bnQgb2YNCj4gdGlt ZS4NCg0KZ29vZCB0byBrbm93IHRob3NlIGJhY2tncm91bmQuDQoNCj4gDQo+IEkgd291bGRuJ3Qg bWluZCBzZWVpbmcgYSBzaG9ydGVyIHRpbWVvdXQgYW5kIHByaW50IG9uIHRoZSBmZW5jZSB0b28N Cj4ganVzdCBpbiBjYXNlLg0KPiANCg0KZWl0aGVyIHdheSBpcyBPSy4gSXQncyBub3QgZGlmZmlj dWx0IHRvIGZpZ3VyZSBvdXQgYSBsb25nIHdhaXQgYW55d2F5LiDwn5iKDQo=
which is something that patch tools - in special patchwork - won't handle.
Thanks, Mauro
From: Mauro Carvalho Chehab mchehab+huawei@kernel.org Sent: Friday, January 30, 2026 1:43 PM
On Fri, 30 Jan 2026 03:12:02 +0000 "Tian, Kevin" kevin.tian@intel.com wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Thursday, January 29, 2026 10:59 PM
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete eventually under kernel control, because these sleeps are sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at least this will work "correctly" even if it takes a huge amount of time.
good to know those background.
I wouldn't mind seeing a shorter timeout and print on the fence too just in case.
either way is OK. It's not difficult to figure out a long wait anyway. 😊
Please don't use Outlook when answering to patches - or ensure that it is properly patched to only send plain text - which I don't think it is possible.
If you look on this message source code, it is not in plain text:
Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64
it's likely caused by the trailing smile icon. I'll pay attention to it.
Your message content is:
PiBGcm9tOiBKYXNvbiBHdW50aG9ycGUgPGpnZ0B6aWVwZS5jYT4NCj 4gU2VudDogVGh1cnNkYXks IEphbnVhcnkgMjksIDIwMjYgMTA6NTkgUE0NCj4gDQo+IE9uIFRodSw gSmFuIDI5LCAyMDI2IGF0 IDA3OjA2OjM3QU0gKzAwMDAsIFRpYW4sIEtldmluIHdyb3RlOg0KPiA+ IEJlYXIgbWUgaWYgaXQn cyBhbiBpZ25vcmFudCBxdWVzdGlvbi4NCj4gPg0KPiA+IFRoZSBjb21taX QgbXNnIG9mIHBhdGNo NiBzYXlzIHRoYXQgVkZJTyBkb2Vzbid0IHRvbGVyYXRlIHVuYm91bmRlZ A0KPiA+IHdhaXQsIHdo aWNoIGlzIHRoZSByZWFzb24gYmVoaW5kIHRoZSAybmQgdGltZW91dC B3YWl0IGhlcmUuDQo+IA0K PiBBcyBmYXIgYXMgSSB1bmRlcnN0YW5kIGRtYWJ1ZiBkZXNpZ24gYSB mZW5jZSB3YWl0IHNob3Vs ZCBjb21wbGV0ZQ0KPiBldmVudHVhbGx5IHVuZGVyIGtlcm5lbCBjb250 cm9sLCBiZWNhdXNlIHRo ZXNlIHNsZWVwcyBhcmUNCj4gc3ByaW5rbGVkIGFsbCBhcm91bmQgd GhlIGtlcm5lbCB0b2RheS4N Cj4gDQo+IEkgc3VzcGVjdCB0aGF0IGlzIG5vdCBhY3R1YWxseSB0cnVlIG ZvciBldmVyeSBIVywg cHJvYmFibHkgc29tZXRoaW5nDQo+IGxpa2UgInNoYWRlciBwcm9ncmF tcyBjYW4gcnVuIGZvcmV2 ZXIgdGVjaG5pY2FsbHkiLg0KPiANCj4gV2UgY2FuIGFyZ3VlIGlmIHRob3 NlIGNhc2VzIHNob3Vs ZCBub3QgcmVwb3J0IHJldm9jYWJsZSBlaXRoZXIsIGJ1dCBhdA0KPiBsZ WFzdCB0aGlzIHdpbGwg d29yayAiY29ycmVjdGx5IiBldmVuIGlmIGl0IHRha2VzIGEgaHVnZSBhbW 91bnQgb2YNCj4gdGlt ZS4NCg0KZ29vZCB0byBrbm93IHRob3NlIGJhY2tncm91bmQuDQoNCj4 gDQo+IEkgd291bGRuJ3Qg bWluZCBzZWVpbmcgYSBzaG9ydGVyIHRpbWVvdXQgYW5kIHByaW50I G9uIHRoZSBmZW5jZSB0b28N Cj4ganVzdCBpbiBjYXNlLg0KPiANCg0KZWl0aGVyIHdheSBpcyBPSy4gSX QncyBub3QgZGlmZmlj dWx0IHRvIGZpZ3VyZSBvdXQgYSBsb25nIHdhaXQgYW55d2F5LiDwn5i KDQo=
which is something that patch tools - in special patchwork - won't handle.
Thanks, Mauro
On 1/29/26 15:58, Jason Gunthorpe wrote:
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
Bear me if it's an ignorant question.
The commit msg of patch6 says that VFIO doesn't tolerate unbounded wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete eventually under kernel control, because these sleeps are sprinkled all around the kernel today.
Well it's a bit different, but we indeed guarantee that dma_fences complete in finite time.
I suspect that is not actually true for every HW, probably something like "shader programs can run forever technically".
Nope, stuff like that is strictly forbidden.
Regards, Christian.
We can argue if those cases should not report revocable either, but at least this will work "correctly" even if it takes a huge amount of time.
I wouldn't mind seeing a shorter timeout and print on the fence too just in case.
Jason
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges;
- struct kref kref;
- struct completion comp; u8 revoked : 1;
}; @@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; } +static void vfio_pci_dma_buf_done(struct kref *kref) +{
- struct vfio_pci_dma_buf *priv =
container_of(kref, struct vfio_pci_dma_buf, kref);- complete(&priv->comp);
+}
static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv); if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);
- ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);- if (IS_ERR(ret))
return ret;- kref_get(&priv->kref);
- return ret;
} static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) {
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- dma_buf_free_sgt(attachment, sgt, dir);
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
} static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
- kref_init(&priv->kref);
- init_completion(&priv->comp);
- /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock);
@@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");
You can do the revoke to do your resource management, for example re-use the backing store for something else.
But it is mandatory that you keep the mapping around indefinitely until the importer closes it.
Before that you can't do things like runtime PM or remove or anything which would make the DMA addresses invalid.
As far as I can see vfio_pci_dma_buf_move() is used exactly for that use case so this here is an absolutely clear NAK from my side for this approach.
You can either split up the functionality of vfio_pci_dma_buf_move() into vfio_pci_dma_buf_invalidate_mappings() and vfio_pci_dma_buf_flush() and then call the later whenever necessary or you keep it in one function and block everybody until the importer has dropped all mappings.
} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file);}
This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
Regards, Christian.
} @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);
On Fri, Jan 30, 2026 at 09:30:59AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges;
- struct kref kref;
- struct completion comp; u8 revoked : 1;
}; @@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; } +static void vfio_pci_dma_buf_done(struct kref *kref) +{
- struct vfio_pci_dma_buf *priv =
container_of(kref, struct vfio_pci_dma_buf, kref);- complete(&priv->comp);
+}
static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv); if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);
- ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);- if (IS_ERR(ret))
return ret;- kref_get(&priv->kref);
- return ret;
} static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) {
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- dma_buf_free_sgt(attachment, sgt, dir);
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
} static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
- kref_init(&priv->kref);
- init_completion(&priv->comp);
- /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock);
@@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");You can do the revoke to do your resource management, for example re-use the backing store for something else.
But it is mandatory that you keep the mapping around indefinitely until the importer closes it.
Before that you can't do things like runtime PM or remove or anything which would make the DMA addresses invalid.
As far as I can see vfio_pci_dma_buf_move() is used exactly for that use case so this here is an absolutely clear NAK from my side for this approach.
You can either split up the functionality of vfio_pci_dma_buf_move() into vfio_pci_dma_buf_invalidate_mappings() and vfio_pci_dma_buf_flush() and then call the later whenever necessary or you keep it in one function and block everybody until the importer has dropped all mappings.
No problem, I can change it to be:
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d087d018d547..53772a84c93b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,23 +357,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); - /* - * Let's wait for 1 second till all DMA unmap - * are completed. It is supposed to catch dma-buf - * importers which lied about their support - * of dmabuf revoke. See dma_buf_invalidate_mappings() - * for the expected behaviour. - */ - wait = wait_for_completion_timeout( - &priv->comp, secs_to_jiffies(1)); - /* - * If you see this WARN_ON, it means that - * importer didn't call unmap in response to - * dma_buf_invalidate_mappings() which is not - * allowed. - */ - WARN(!wait, - "Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()"); + wait_for_completion(&priv->comp); } else { /* * Kref is initialize again, because when revoke
Do you want me to send v6?
Thanks
} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file);}This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
Regards, Christian.
} @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);
On 1/30/26 14:01, Leon Romanovsky wrote:
On Fri, Jan 30, 2026 at 09:30:59AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges;
- struct kref kref;
- struct completion comp; u8 revoked : 1;
}; @@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; } +static void vfio_pci_dma_buf_done(struct kref *kref) +{
- struct vfio_pci_dma_buf *priv =
container_of(kref, struct vfio_pci_dma_buf, kref);- complete(&priv->comp);
+}
static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv); if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);
- ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);- if (IS_ERR(ret))
return ret;- kref_get(&priv->kref);
- return ret;
} static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) {
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- dma_buf_free_sgt(attachment, sgt, dir);
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
} static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
- kref_init(&priv->kref);
- init_completion(&priv->comp);
- /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock);
@@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");You can do the revoke to do your resource management, for example re-use the backing store for something else.
But it is mandatory that you keep the mapping around indefinitely until the importer closes it.
Before that you can't do things like runtime PM or remove or anything which would make the DMA addresses invalid.
As far as I can see vfio_pci_dma_buf_move() is used exactly for that use case so this here is an absolutely clear NAK from my side for this approach.
You can either split up the functionality of vfio_pci_dma_buf_move() into vfio_pci_dma_buf_invalidate_mappings() and vfio_pci_dma_buf_flush() and then call the later whenever necessary or you keep it in one function and block everybody until the importer has dropped all mappings.
No problem, I can change it to be:
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d087d018d547..53772a84c93b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,23 +357,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done);
/** Let's wait for 1 second till all DMA unmap* are completed. It is supposed to catch dma-buf* importers which lied about their support* of dmabuf revoke. See dma_buf_invalidate_mappings()* for the expected behaviour.*/wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");
wait_for_completion(&priv->comp); } else { /* * Kref is initialize again, because when revokeDo you want me to send v6?
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
What exporters usually do instead is to grab references, e.g. call pm_runtime_get_sync() when either a DMA-buf, a DMA-buf attachment or in your case here a mapping of this attachment is made.
But all of this is just a suggestion, if you are fine with blocking then feel free to add my rb.
Regards, Christian.
Thanks
} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file);}This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
Regards, Christian.
} @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
On 1/30/26 14:01, Leon Romanovsky wrote:
On Fri, Jan 30, 2026 at 09:30:59AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
dma-buf invalidation is handled asynchronously by the hardware, so VFIO must wait until all affected objects have been fully invalidated.
In addition, the dma-buf exporter is expecting that all importers unmap any buffers they previously mapped.
Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d8ceafabef48..485515629fe4 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -17,6 +17,8 @@ struct vfio_pci_dma_buf { struct dma_buf_phys_vec *phys_vec; struct p2pdma_provider *provider; u32 nr_ranges;
- struct kref kref;
- struct completion comp; u8 revoked : 1;
}; @@ -44,27 +46,46 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, return 0; } +static void vfio_pci_dma_buf_done(struct kref *kref) +{
- struct vfio_pci_dma_buf *priv =
container_of(kref, struct vfio_pci_dma_buf, kref);- complete(&priv->comp);
+}
static struct sg_table * vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct sg_table *ret;
dma_resv_assert_held(priv->dmabuf->resv); if (priv->revoked) return ERR_PTR(-ENODEV);
- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);
- ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,priv->size, dir);- if (IS_ERR(ret))
return ret;- kref_get(&priv->kref);
- return ret;
} static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) {
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- dma_buf_free_sgt(attachment, sgt, dir);
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
} static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) @@ -287,6 +308,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, goto err_dev_put; }
- kref_init(&priv->kref);
- init_completion(&priv->comp);
- /* dma_buf_put() now frees priv */ INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock);
@@ -326,6 +350,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_lock(priv->dmabuf->resv, NULL); priv->revoked = revoked; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(priv->dmabuf->resv);if (revoked) {kref_put(&priv->kref, vfio_pci_dma_buf_done);/* Let's wait till all DMA unmap are completed. */wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");You can do the revoke to do your resource management, for example re-use the backing store for something else.
But it is mandatory that you keep the mapping around indefinitely until the importer closes it.
Before that you can't do things like runtime PM or remove or anything which would make the DMA addresses invalid.
As far as I can see vfio_pci_dma_buf_move() is used exactly for that use case so this here is an absolutely clear NAK from my side for this approach.
You can either split up the functionality of vfio_pci_dma_buf_move() into vfio_pci_dma_buf_invalidate_mappings() and vfio_pci_dma_buf_flush() and then call the later whenever necessary or you keep it in one function and block everybody until the importer has dropped all mappings.
No problem, I can change it to be:
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index d087d018d547..53772a84c93b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -357,23 +357,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) dma_resv_unlock(priv->dmabuf->resv); if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done);
/** Let's wait for 1 second till all DMA unmap* are completed. It is supposed to catch dma-buf* importers which lied about their support* of dmabuf revoke. See dma_buf_invalidate_mappings()* for the expected behaviour.*/wait = wait_for_completion_timeout(&priv->comp, secs_to_jiffies(1));/** If you see this WARN_ON, it means that* importer didn't call unmap in response to* dma_buf_invalidate_mappings() which is not* allowed.*/WARN(!wait,"Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");
wait_for_completion(&priv->comp); } else { /* * Kref is initialize again, because when revokeDo you want me to send v6?
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
What exporters usually do instead is to grab references, e.g. call pm_runtime_get_sync() when either a DMA-buf, a DMA-buf attachment or in your case here a mapping of this attachment is made.
I view this as an enhancement that can be addressed later down the road.
But all of this is just a suggestion, if you are fine with blocking then feel free to add my rb.
It is fine for initial version. We need to start somewhere.
Thanks
Regards, Christian.
Thanks
} else {/** Kref is initialize again, because when revoke* was performed the reference counter was decreased* to zero to trigger completion.*/kref_init(&priv->kref);/** There is no need to wait as no mapping was* performed when the previous status was* priv->revoked == true.*/reinit_completion(&priv->comp); } fput(priv->dmabuf->file);}This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
Regards, Christian.
} @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
unsigned long wait;- if (!get_file_active(&priv->dmabuf->file)) continue;
@@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) priv->vdev = NULL; priv->revoked = true; dma_buf_invalidate_mappings(priv->dmabuf);
dma_resv_wait_timeout(priv->dmabuf->resv,DMA_RESV_USAGE_BOOKKEEP, false, dma_resv_unlock(priv->dmabuf->resv);MAX_SCHEDULE_TIMEOUT);kref_put(&priv->kref, vfio_pci_dma_buf_done);wait = wait_for_completion_timeout(&priv->comp,secs_to_jiffies(1)); vfio_device_put_registration(&vdev->vdev); fput(priv->dmabuf->file); }WARN_ON(!wait);
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
} fput(priv->dmabuf->file);This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
If we hold a refcount while on the list then the FD can never be closed.
There is locking protecting the list so that it is safe and close continues to work right.
Jason
On 1/30/26 14:56, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
Yeah, but you only need that in a couple of use cases and not all.
Especially when the device is idle exporters usually don't want runtime PM to kick in and try to suspend the device while importers are still using it.
} fput(priv->dmabuf->file);This is also extremely questionable. Why doesn't the dmabuf have a reference while on the linked list?
If we hold a refcount while on the list then the FD can never be closed.
There is locking protecting the list so that it is safe and close continues to work right.
Ah! You use get_file_active(), yeah that should work.
Thanks, Christian.
Jason
On Fri, Jan 30, 2026 at 03:11:48PM +0100, Christian König wrote:
On 1/30/26 14:56, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
Yeah, but you only need that in a couple of use cases and not all.
Not all, that is why the dma_buf_attach_revocable() is there to distinguish this case from others.
Jason
On 1/30/26 15:44, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 03:11:48PM +0100, Christian König wrote:
On 1/30/26 14:56, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
Yeah, but you only need that in a couple of use cases and not all.
Not all, that is why the dma_buf_attach_revocable() is there to distinguish this case from others.
No, no that's not what I mean.
See on the one hand you have runtime PM which automatically shuts down your device after some time when the last user stops using it.
Then on the other hand you have an intentional revoke triggered by userspace.
As far as I've read up on the code currently both are handled the same way, and that doesn't sound correct to me.
Runtime PM should *not* trigger automatically when there are still mappings or even DMA-bufs in existence for the VFIO device.
Regards, Christian.
Jason
On Mon, Feb 02, 2026 at 09:42:22AM +0100, Christian König wrote:
On 1/30/26 15:44, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 03:11:48PM +0100, Christian König wrote:
On 1/30/26 14:56, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
Yeah, but you only need that in a couple of use cases and not all.
Not all, that is why the dma_buf_attach_revocable() is there to distinguish this case from others.
No, no that's not what I mean.
See on the one hand you have runtime PM which automatically shuts down your device after some time when the last user stops using it.
Then on the other hand you have an intentional revoke triggered by userspace.
As far as I've read up on the code currently both are handled the same way, and that doesn't sound correct to me.
Runtime PM should *not* trigger automatically when there are still mappings or even DMA-bufs in existence for the VFIO device.
I'm a little confused why we are talking about runtime PM - are you pointing out an issue in VFIO today where it's PM support is not good?
I admit I don't know a lot about VFIO PM support.. Though I thought in the VFIO case PM was actually under userspace control as generally the PM control is delegated to the VM.
Through that lens, what is happening here is correct. If the VM requests to shut down VFIO PM (through a hypervisor vfio ioctl) then we do want to revoke the DMABUF so that the VM can't trigger a AER/etc by trying to access the sleeping PCI device.
I don't think VFIO uses automatic PM on a timer, that doesn't make sense for it's programming model.
Jason
On 2/2/26 16:12, Jason Gunthorpe wrote:
On Mon, Feb 02, 2026 at 09:42:22AM +0100, Christian König wrote:
On 1/30/26 15:44, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 03:11:48PM +0100, Christian König wrote:
On 1/30/26 14:56, Jason Gunthorpe wrote:
On Fri, Jan 30, 2026 at 02:21:08PM +0100, Christian König wrote:
That would work for me.
Question is if you really want to do it this way? See usually exporters try to avoid blocking such functions.
Yes, it has to be this way, revoke is a synchronous user space triggered operation around things like FLR or device close. We can't defer it into some background operation like pm.
Yeah, but you only need that in a couple of use cases and not all.
Not all, that is why the dma_buf_attach_revocable() is there to distinguish this case from others.
No, no that's not what I mean.
See on the one hand you have runtime PM which automatically shuts down your device after some time when the last user stops using it.
Then on the other hand you have an intentional revoke triggered by userspace.
As far as I've read up on the code currently both are handled the same way, and that doesn't sound correct to me.
Runtime PM should *not* trigger automatically when there are still mappings or even DMA-bufs in existence for the VFIO device.
I'm a little confused why we are talking about runtime PM - are you pointing out an issue in VFIO today where it's PM support is not good?
Exactly that, yes. This patch set here doesn't break it, but most likely makes the effect quite worse.
I admit I don't know a lot about VFIO PM support.. Though I thought in the VFIO case PM was actually under userspace control as generally the PM control is delegated to the VM.
Through that lens, what is happening here is correct. If the VM requests to shut down VFIO PM (through a hypervisor vfio ioctl) then we do want to revoke the DMABUF so that the VM can't trigger a AER/etc by trying to access the sleeping PCI device.
I don't think VFIO uses automatic PM on a timer, that doesn't make sense for it's programming model.
From your description I agree that this doesn't make sense, but from the code it looks like exactly that is done.
Grep for pm_runtime_* on drivers/vfio/pci, but could be that I misunderstood the functionality, e.g. didn't spend to much time on it.
Just keep it in the back of your mind and maybe double check if that is actually the desired behavior.
Regards, Christian.
Jason
On Mon, Feb 02, 2026 at 04:21:50PM +0100, Christian König wrote:
I admit I don't know a lot about VFIO PM support.. Though I thought in the VFIO case PM was actually under userspace control as generally the PM control is delegated to the VM.
Through that lens, what is happening here is correct. If the VM requests to shut down VFIO PM (through a hypervisor vfio ioctl) then we do want to revoke the DMABUF so that the VM can't trigger a AER/etc by trying to access the sleeping PCI device.
I don't think VFIO uses automatic PM on a timer, that doesn't make sense for it's programming model.
From your description I agree that this doesn't make sense, but from the code it looks like exactly that is done.
Grep for pm_runtime_* on drivers/vfio/pci, but could be that I misunderstood the functionality, e.g. didn't spend to much time on it.
Just keep it in the back of your mind and maybe double check if that is actually the desired behavior.
I had a small conversation with AlexW and we think VFIO is OK (bugs excluded).
The use of the PM timer is still under userspace control, even though a timer is still involved.
Basically there are a series of IOCTL defined in VFIO, like LOW_POWER_ENTRY that all isolate the PCI device from userspace. The mmap is blocked with SIBGUS and the DMABUFs are revoked.
The VFIO uAPI contract requries userspace to stop touching the device immediately when using these IOCTLs. The PM timer may still be involved, but is an implementation detail.
Effectively VFIO has a device state "isolated" meaning that userspace cannot access the MMIO, and it enters this state based on various IOCTLs from userspace. It ties mmap and DMABUF together so that if mmap SIGBUS's the DMABUF is unmapped.
I understand your remarks, and this use of PM is certainly nothing that any other driver should copy, but it does make sense for VFIO. If there are bugs/issues we would continue to keep the overall property that SGIBUS==DMABUF unmapped and only adjust when that happens.
TBH, I don't think people use the VFIO PM feature very much.
Jason
From: Leon Romanovsky leonro@nvidia.com
The .invalidate_mapping() callback is documented as optional, yet it effectively became mandatory whenever importer_ops were provided. This led to cases where RDMA non-ODP code had to supply an empty stub.
Relax the checks in the dma-buf core so the callback can be omitted, allowing RDMA code to drop the unnecessary function.
Removing the stub allows the next patch to tell that RDMA does not support .invalidate_mapping() by checking for a NULL op.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/dma-buf/dma-buf.c | 6 ++---- drivers/infiniband/core/umem_dmabuf.c | 13 ------------- 2 files changed, 2 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cd68c1c0bfd7..1629312d364a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -947,9 +947,6 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL);
- if (WARN_ON(importer_ops && !importer_ops->invalidate_mappings)) - return ERR_PTR(-EINVAL); - attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM); @@ -1260,7 +1257,8 @@ void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) dma_resv_assert_held(dmabuf->resv);
list_for_each_entry(attach, &dmabuf->attachments, node) - if (attach->importer_ops) + if (attach->importer_ops && + attach->importer_ops->invalidate_mappings) attach->importer_ops->invalidate_mappings(attach); } EXPORT_SYMBOL_NS_GPL(dma_buf_invalidate_mappings, "DMA_BUF"); diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index d77a739cfe7a..256e34c15e6b 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -129,9 +129,6 @@ ib_umem_dmabuf_get_with_dma_device(struct ib_device *device, if (check_add_overflow(offset, (unsigned long)size, &end)) return ret;
- if (unlikely(!ops || !ops->invalidate_mappings)) - return ret; - dmabuf = dma_buf_get(fd); if (IS_ERR(dmabuf)) return ERR_CAST(dmabuf); @@ -184,18 +181,8 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, } EXPORT_SYMBOL(ib_umem_dmabuf_get);
-static void -ib_umem_dmabuf_unsupported_move_notify(struct dma_buf_attachment *attach) -{ - struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv; - - ibdev_warn_ratelimited(umem_dmabuf->umem.ibdev, - "Invalidate callback should not be called when memory is pinned\n"); -} - static struct dma_buf_attach_ops ib_umem_dmabuf_attach_pinned_ops = { .allow_peer2peer = true, - .invalidate_mappings = ib_umem_dmabuf_unsupported_move_notify, };
struct ib_umem_dmabuf *
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
The .invalidate_mapping() callback is documented as optional, yet it effectively became mandatory whenever importer_ops were provided. This led to cases where RDMA non-ODP code had to supply an empty stub.
Relax the checks in the dma-buf core so the callback can be omitted, allowing RDMA code to drop the unnecessary function.
Removing the stub allows the next patch to tell that RDMA does not support .invalidate_mapping() by checking for a NULL op.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/dma-buf/dma-buf.c | 6 ++---- drivers/infiniband/core/umem_dmabuf.c | 13 ------------- 2 files changed, 2 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cd68c1c0bfd7..1629312d364a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -947,9 +947,6 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL);
- if (WARN_ON(importer_ops && !importer_ops->invalidate_mappings))
return ERR_PTR(-EINVAL);- attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM);
@@ -1260,7 +1257,8 @@ void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) dma_resv_assert_held(dmabuf->resv); list_for_each_entry(attach, &dmabuf->attachments, node)
if (attach->importer_ops)
if (attach->importer_ops &&attach->importer_ops->invalidate_mappings) attach->importer_ops->invalidate_mappings(attach);} EXPORT_SYMBOL_NS_GPL(dma_buf_invalidate_mappings, "DMA_BUF"); diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index d77a739cfe7a..256e34c15e6b 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -129,9 +129,6 @@ ib_umem_dmabuf_get_with_dma_device(struct ib_device *device, if (check_add_overflow(offset, (unsigned long)size, &end)) return ret;
- if (unlikely(!ops || !ops->invalidate_mappings))
You should probably keep "if (unlikely(!ops)).." here.
Apart from that the patch looks good to me.
Regards, Christian.
return ret;- dmabuf = dma_buf_get(fd); if (IS_ERR(dmabuf)) return ERR_CAST(dmabuf);
@@ -184,18 +181,8 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, } EXPORT_SYMBOL(ib_umem_dmabuf_get); -static void -ib_umem_dmabuf_unsupported_move_notify(struct dma_buf_attachment *attach) -{
- struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
- ibdev_warn_ratelimited(umem_dmabuf->umem.ibdev,
"Invalidate callback should not be called when memory is pinned\n");-}
static struct dma_buf_attach_ops ib_umem_dmabuf_attach_pinned_ops = { .allow_peer2peer = true,
- .invalidate_mappings = ib_umem_dmabuf_unsupported_move_notify,
}; struct ib_umem_dmabuf *
On Fri, Jan 30, 2026 at 09:30:05AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
The .invalidate_mapping() callback is documented as optional, yet it effectively became mandatory whenever importer_ops were provided. This led to cases where RDMA non-ODP code had to supply an empty stub.
Relax the checks in the dma-buf core so the callback can be omitted, allowing RDMA code to drop the unnecessary function.
Removing the stub allows the next patch to tell that RDMA does not support .invalidate_mapping() by checking for a NULL op.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/dma-buf/dma-buf.c | 6 ++---- drivers/infiniband/core/umem_dmabuf.c | 13 ------------- 2 files changed, 2 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cd68c1c0bfd7..1629312d364a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -947,9 +947,6 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL);
- if (WARN_ON(importer_ops && !importer_ops->invalidate_mappings))
return ERR_PTR(-EINVAL);- attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM);
@@ -1260,7 +1257,8 @@ void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) dma_resv_assert_held(dmabuf->resv); list_for_each_entry(attach, &dmabuf->attachments, node)
if (attach->importer_ops)
if (attach->importer_ops &&attach->importer_ops->invalidate_mappings) attach->importer_ops->invalidate_mappings(attach);} EXPORT_SYMBOL_NS_GPL(dma_buf_invalidate_mappings, "DMA_BUF"); diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index d77a739cfe7a..256e34c15e6b 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -129,9 +129,6 @@ ib_umem_dmabuf_get_with_dma_device(struct ib_device *device, if (check_add_overflow(offset, (unsigned long)size, &end)) return ret;
- if (unlikely(!ops || !ops->invalidate_mappings))
You should probably keep "if (unlikely(!ops)).." here.
The unlikely is useless in this path.
Apart from that the patch looks good to me.
Regards, Christian.
return ret;- dmabuf = dma_buf_get(fd); if (IS_ERR(dmabuf)) return ERR_CAST(dmabuf);
@@ -184,18 +181,8 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, } EXPORT_SYMBOL(ib_umem_dmabuf_get); -static void -ib_umem_dmabuf_unsupported_move_notify(struct dma_buf_attachment *attach) -{
- struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
- ibdev_warn_ratelimited(umem_dmabuf->umem.ibdev,
"Invalidate callback should not be called when memory is pinned\n");-}
static struct dma_buf_attach_ops ib_umem_dmabuf_attach_pinned_ops = { .allow_peer2peer = true,
- .invalidate_mappings = ib_umem_dmabuf_unsupported_move_notify,
}; struct ib_umem_dmabuf *
From: Leon Romanovsky leonro@nvidia.com
Some exporters need a flow to synchronously revoke access to the DMA-buf by importers. Once revoke is completed the importer is not permitted to touch the memory otherwise they may get IOMMU faults, AERs, or worse.
DMA-buf today defines a revoke flow, for both pinned and dynamic importers, which is broadly:
dma_resv_lock(dmabuf->resv, NULL); // Prevent new mappings from being established priv->revoked = true;
// Tell all importers to eventually unmap dma_buf_invalidate_mappings(dmabuf);
// Wait for any inprogress fences on the old mapping dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP, false, MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(dmabuf->resv, NULL);
// Wait for all importers to complete unmap wait_for_completion(&priv->unmapped_comp);
This works well, and an importer that continues to access the DMA-buf after unmapping it is very buggy.
However, the final wait for unmap is effectively unbounded. Several importers do not support invalidate_mappings() at all and won't unmap until userspace triggers it.
This unbounded wait is not suitable for exporters like VFIO and RDMA tha need to issue revoke as part of their normal operations.
Add dma_buf_attach_revocable() to allow exporters to determine the difference between importers that can complete the above in bounded time, and those that can't. It can be called inside the exporter's attach op to reject incompatible importers.
Document these details about how dma_buf_invalidate_mappings() works and what the required sequence is to achieve a full revocation.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/dma-buf/dma-buf.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 9 +++------ 2 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 1629312d364a..f0e05227bda8 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1242,13 +1242,59 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
+/** + * dma_buf_attach_revocable - check if a DMA-buf importer implements + * revoke semantics. + * @attach: the DMA-buf attachment to check + * + * Returns true if the DMA-buf importer can support the revoke sequence + * explained in dma_buf_invalidate_mappings() within bounded time. Meaning the + * importer implements invalidate_mappings() and ensures that unmap is called as + * a result. + */ +bool dma_buf_attach_revocable(struct dma_buf_attachment *attach) +{ + return attach->importer_ops && + attach->importer_ops->invalidate_mappings; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_attach_revocable, "DMA_BUF"); + /** * dma_buf_invalidate_mappings - notify attachments that DMA-buf is moving * * @dmabuf: [in] buffer which is moving * * Informs all attachments that they need to destroy and recreate all their - * mappings. + * mappings. If the attachment is dynamic then the dynamic importer is expected + * to invalidate any caches it has of the mapping result and perform a new + * mapping request before allowing HW to do any further DMA. + * + * If the attachment is pinned then this informs the pinned importer that the + * underlying mapping is no longer available. Pinned importers may take this is + * as a permanent revocation and never establish new mappings so exporters + * should not trigger it lightly. + * + * Upon return importers may continue to access the DMA-buf memory. The caller + * must do two additional waits to ensure that the memory is no longer being + * accessed: + * 1) Until dma_resv_wait_timeout() retires fences the importer is allowed to + * fully access the memory. + * 2) Until the importer calls unmap it is allowed to speculatively + * read-and-discard the memory. It must not write to the memory. + * + * A caller wishing to use dma_buf_invalidate_mappings() to fully stop access to + * the DMA-buf must wait for both. Dynamic callers can often use just the first. + * + * All importers providing a invalidate_mappings() op must ensure that unmap is + * called within bounded time after the op. + * + * Pinned importers that do not support a invalidate_mappings() op will + * eventually perform unmap when they are done with the buffer, which may be an + * ubounded time from calling this function. dma_buf_attach_revocable() can be + * used to prevent such importers from attaching. + * + * Importers are free to request a new mapping in parallel as this function + * returns. */ void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d5c3ce2b3aa4..84a7ec8f5359 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -468,12 +468,8 @@ struct dma_buf_attach_ops { * called with this lock held as well. This makes sure that no mapping * is created concurrently with an ongoing move operation. * - * Mappings stay valid and are not directly affected by this callback. - * But the DMA-buf can now be in a different physical location, so all - * mappings should be destroyed and re-created as soon as possible. - * - * New mappings can be created after this callback returns, and will - * point to the new location of the DMA-buf. + * See the kdoc for dma_buf_invalidate_mappings() for details on the + * required behavior. */ void (*invalidate_mappings)(struct dma_buf_attachment *attach); }; @@ -601,6 +597,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); +bool dma_buf_attach_revocable(struct dma_buf_attachment *attach); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
Hi Leon,
On Sat, Jan 24, 2026 at 09:14:18PM +0200, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Some exporters need a flow to synchronously revoke access to the DMA-buf by importers. Once revoke is completed the importer is not permitted to touch the memory otherwise they may get IOMMU faults, AERs, or worse.
DMA-buf today defines a revoke flow, for both pinned and dynamic importers, which is broadly:
dma_resv_lock(dmabuf->resv, NULL); // Prevent new mappings from being established priv->revoked = true;
// Tell all importers to eventually unmap dma_buf_invalidate_mappings(dmabuf);
// Wait for any inprogress fences on the old mapping dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP, false, MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(dmabuf->resv, NULL);
// Wait for all importers to complete unmap wait_for_completion(&priv->unmapped_comp);
This works well, and an importer that continues to access the DMA-buf after unmapping it is very buggy.
However, the final wait for unmap is effectively unbounded. Several importers do not support invalidate_mappings() at all and won't unmap until userspace triggers it.
This unbounded wait is not suitable for exporters like VFIO and RDMA tha need to issue revoke as part of their normal operations.
Add dma_buf_attach_revocable() to allow exporters to determine the difference between importers that can complete the above in bounded time, and those that can't. It can be called inside the exporter's attach op to reject incompatible importers.
Document these details about how dma_buf_invalidate_mappings() works and what the required sequence is to achieve a full revocation.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/dma-buf/dma-buf.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 9 +++------ 2 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 1629312d364a..f0e05227bda8 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1242,13 +1242,59 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); +/**
- dma_buf_attach_revocable - check if a DMA-buf importer implements
- revoke semantics.
- @attach: the DMA-buf attachment to check
- Returns true if the DMA-buf importer can support the revoke sequence
- explained in dma_buf_invalidate_mappings() within bounded time. Meaning the
- importer implements invalidate_mappings() and ensures that unmap is called as
- a result.
- */
+bool dma_buf_attach_revocable(struct dma_buf_attachment *attach) +{
- return attach->importer_ops &&
attach->importer_ops->invalidate_mappings;+} +EXPORT_SYMBOL_NS_GPL(dma_buf_attach_revocable, "DMA_BUF");
I noticed that Patch 5 removes the invalidate_mappings stub from umem_dmabuf.c, effectively making the callback NULL for an RDMA importer. Consequently, dma_buf_attach_revocable() (introduced here) will return false for these importers.
Since the cover letter mentions that VFIO will use dma_buf_attach_revocable() to prevent unbounded waits, this appears to effectively block paths like the VFIO-export -> RDMA-import path..
Given that RDMA is a significant consumer of dma-bufs, are there plans to implement proper revocation support in the IB/RDMA core (umem_dmabuf)?
It would be good to know if there's a plan for bringing such importers into compliance with the new revocation semantics so they can interop with VFIO OR are we completely ruling out users like RDMA / IB importing any DMABUFs exported by VFIO?
Thanks, Praan
On Mon, Jan 26, 2026 at 08:38:44PM +0000, Pranjal Shrivastava wrote:
I noticed that Patch 5 removes the invalidate_mappings stub from umem_dmabuf.c, effectively making the callback NULL for an RDMA importer. Consequently, dma_buf_attach_revocable() (introduced here) will return false for these importers.
Yes, that is the intention.
Since the cover letter mentions that VFIO will use dma_buf_attach_revocable() to prevent unbounded waits, this appears to effectively block paths like the VFIO-export -> RDMA-import path..
It remains usable with the ODP path and people are using that right now.
Given that RDMA is a significant consumer of dma-bufs, are there plans to implement proper revocation support in the IB/RDMA core (umem_dmabuf)?
This depends on each HW, they need a way to implement the revoke semantic. I can't guess what is possible, but I would hope that most HW could at least do a revoke on a real MR.
Eg a MR rereg operation to a kernel owned empty PD is an effective "revoke", and MR rereg is at least defined by standards so HW should implement it.
It would be good to know if there's a plan for bringing such importers into compliance with the new revocation semantics so they can interop with VFIO OR are we completely ruling out users like RDMA / IB importing any DMABUFs exported by VFIO?
It will be driver dependent, there is no one shot update here.
Jason
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Some exporters need a flow to synchronously revoke access to the DMA-buf by importers. Once revoke is completed the importer is not permitted to touch the memory otherwise they may get IOMMU faults, AERs, or worse.
That approach is seriously not going to fly.
You can use the invalidate_mappings approach to trigger the importer to give back the mapping, but when the mapping is really given back is still completely on the importer side.
In other words you can't do the shot down revoke semantics you are trying to establish here.
Regards, Christian.
DMA-buf today defines a revoke flow, for both pinned and dynamic importers, which is broadly:
dma_resv_lock(dmabuf->resv, NULL); // Prevent new mappings from being established priv->revoked = true;
// Tell all importers to eventually unmap dma_buf_invalidate_mappings(dmabuf);
// Wait for any inprogress fences on the old mapping dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP, false, MAX_SCHEDULE_TIMEOUT); dma_resv_unlock(dmabuf->resv, NULL);
// Wait for all importers to complete unmap wait_for_completion(&priv->unmapped_comp);
This works well, and an importer that continues to access the DMA-buf after unmapping it is very buggy.
However, the final wait for unmap is effectively unbounded. Several importers do not support invalidate_mappings() at all and won't unmap until userspace triggers it.
This unbounded wait is not suitable for exporters like VFIO and RDMA tha need to issue revoke as part of their normal operations.
Add dma_buf_attach_revocable() to allow exporters to determine the difference between importers that can complete the above in bounded time, and those that can't. It can be called inside the exporter's attach op to reject incompatible importers.
Document these details about how dma_buf_invalidate_mappings() works and what the required sequence is to achieve a full revocation.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/dma-buf/dma-buf.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 9 +++------ 2 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 1629312d364a..f0e05227bda8 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1242,13 +1242,59 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); +/**
- dma_buf_attach_revocable - check if a DMA-buf importer implements
- revoke semantics.
- @attach: the DMA-buf attachment to check
- Returns true if the DMA-buf importer can support the revoke sequence
- explained in dma_buf_invalidate_mappings() within bounded time. Meaning the
- importer implements invalidate_mappings() and ensures that unmap is called as
- a result.
- */
+bool dma_buf_attach_revocable(struct dma_buf_attachment *attach) +{
- return attach->importer_ops &&
attach->importer_ops->invalidate_mappings;+} +EXPORT_SYMBOL_NS_GPL(dma_buf_attach_revocable, "DMA_BUF");
/**
- dma_buf_invalidate_mappings - notify attachments that DMA-buf is moving
- @dmabuf: [in] buffer which is moving
- Informs all attachments that they need to destroy and recreate all their
- mappings.
- mappings. If the attachment is dynamic then the dynamic importer is expected
- to invalidate any caches it has of the mapping result and perform a new
- mapping request before allowing HW to do any further DMA.
- If the attachment is pinned then this informs the pinned importer that the
- underlying mapping is no longer available. Pinned importers may take this is
- as a permanent revocation and never establish new mappings so exporters
- should not trigger it lightly.
- Upon return importers may continue to access the DMA-buf memory. The caller
- must do two additional waits to ensure that the memory is no longer being
- accessed:
- Until dma_resv_wait_timeout() retires fences the importer is allowed to
fully access the memory.
- Until the importer calls unmap it is allowed to speculatively
read-and-discard the memory. It must not write to the memory.
- A caller wishing to use dma_buf_invalidate_mappings() to fully stop access to
- the DMA-buf must wait for both. Dynamic callers can often use just the first.
- All importers providing a invalidate_mappings() op must ensure that unmap is
- called within bounded time after the op.
- Pinned importers that do not support a invalidate_mappings() op will
- eventually perform unmap when they are done with the buffer, which may be an
- ubounded time from calling this function. dma_buf_attach_revocable() can be
- used to prevent such importers from attaching.
- Importers are free to request a new mapping in parallel as this function
*/
- returns.
void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d5c3ce2b3aa4..84a7ec8f5359 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -468,12 +468,8 @@ struct dma_buf_attach_ops { * called with this lock held as well. This makes sure that no mapping * is created concurrently with an ongoing move operation. *
* Mappings stay valid and are not directly affected by this callback.* But the DMA-buf can now be in a different physical location, so all* mappings should be destroyed and re-created as soon as possible.** New mappings can be created after this callback returns, and will* point to the new location of the DMA-buf.
* See the kdoc for dma_buf_invalidate_mappings() for details on the */ void (*invalidate_mappings)(struct dma_buf_attachment *attach);* required behavior.}; @@ -601,6 +597,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); +bool dma_buf_attach_revocable(struct dma_buf_attachment *attach); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
On Fri, Jan 30, 2026 at 09:43:22AM +0100, Christian König wrote:
On 1/24/26 20:14, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Some exporters need a flow to synchronously revoke access to the DMA-buf by importers. Once revoke is completed the importer is not permitted to touch the memory otherwise they may get IOMMU faults, AERs, or worse.
That approach is seriously not going to fly.
You can use the invalidate_mappings approach to trigger the importer to give back the mapping, but when the mapping is really given back is still completely on the importer side.
Yes, and that is what this is all doing, there is the wait for the importer's unmap to happen in the sequence.
In other words you can't do the shot down revoke semantics you are trying to establish here.
All this is doing is saying if dma_buf_attach_revocable() == true then the importer will call unmap within bounded time after dma_buf_invalidate_mappings().
That's it. If the importing driver doesn't want to do that then it should make dma_buf_attach_revocable()=false.
VFIO/etc only want to interwork with importers that can do this.
Jason
From: Leon Romanovsky leonro@nvidia.com
Till now VFIO has rejected pinned importers, largely to avoid being used with the RDMA pinned importer that cannot handle a move_notify() to revoke access.
Using dma_buf_attach_revocable() it can tell the difference between pinned importers that support the flow described in dma_buf_invalidate_mappings() and those that don't.
Thus permit compatible pinned importers.
This is one of two items IOMMUFD requires to remove its private interface to VFIO's dma-buf.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 485515629fe4..3c8dc56e2238 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -22,16 +22,6 @@ struct vfio_pci_dma_buf { u8 revoked : 1; };
-static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) -{ - return -EOPNOTSUPP; -} - -static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) -{ - /* Do nothing */ -} - static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -43,6 +33,9 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (priv->revoked) return -ENODEV;
+ if (!dma_buf_attach_revocable(attachment)) + return -EOPNOTSUPP; + return 0; }
@@ -107,8 +100,6 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) }
static const struct dma_buf_ops vfio_pci_dmabuf_ops = { - .pin = vfio_pci_dma_buf_pin, - .unpin = vfio_pci_dma_buf_unpin, .attach = vfio_pci_dma_buf_attach, .map_dma_buf = vfio_pci_dma_buf_map, .unmap_dma_buf = vfio_pci_dma_buf_unmap,
On Sat, 24 Jan 2026 21:14:19 +0200 Leon Romanovsky leon@kernel.org wrote:
From: Leon Romanovsky leonro@nvidia.com
Till now VFIO has rejected pinned importers, largely to avoid being used with the RDMA pinned importer that cannot handle a move_notify() to revoke access.
Using dma_buf_attach_revocable() it can tell the difference between pinned importers that support the flow described in dma_buf_invalidate_mappings() and those that don't.
Thus permit compatible pinned importers.
This is one of two items IOMMUFD requires to remove its private interface to VFIO's dma-buf.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 485515629fe4..3c8dc56e2238 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -22,16 +22,6 @@ struct vfio_pci_dma_buf { u8 revoked : 1; }; -static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) -{
- return -EOPNOTSUPP;
-}
-static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) -{
- /* Do nothing */
-}
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -43,6 +33,9 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (priv->revoked) return -ENODEV;
- if (!dma_buf_attach_revocable(attachment))
return -EOPNOTSUPP;- return 0;
} @@ -107,8 +100,6 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) } static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
- .pin = vfio_pci_dma_buf_pin,
- .unpin = vfio_pci_dma_buf_unpin, .attach = vfio_pci_dma_buf_attach, .map_dma_buf = vfio_pci_dma_buf_map, .unmap_dma_buf = vfio_pci_dma_buf_unmap,
I'm not sure what the merge plan is for the remaining patches, but I'll toss my ack in so that it can all go through Christian's tree if he prefers. Thanks
Reviewed-by: Alex Williamson alex@shazbot.org
From: Leon Romanovsky leon@kernel.org Sent: Sunday, January 25, 2026 3:14 AM
From: Leon Romanovsky leonro@nvidia.com
Till now VFIO has rejected pinned importers, largely to avoid being used with the RDMA pinned importer that cannot handle a move_notify() to revoke access.
Using dma_buf_attach_revocable() it can tell the difference between pinned importers that support the flow described in dma_buf_invalidate_mappings() and those that don't.
Thus permit compatible pinned importers.
This is one of two items IOMMUFD requires to remove its private interface to VFIO's dma-buf.
Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Leon Romanovsky leonro@nvidia.com
IOMMUFD relies on a private protocol with VFIO, and this always operated in pinned mode.
Now that VFIO can support pinned importers update IOMMUFD to invoke the normal dma-buf flow to request pin.
This isn't enough to allow IOMMUFD to work with other exporters, it still needs a way to get the physical address list which is another series.
IOMMUFD supports the defined revoke semantics. It immediately stops and fences access to the memory inside it's invalidate_mappings() callback, and it currently doesn't use scatterlists so doesn't call map/unmap at all.
It is expected that a future revision can synchronously call unmap from the move_notify callback as well.
Acked-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/iommu/iommufd/pages.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 76f900fa1687..a5eb2bc4ef48 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -1501,16 +1501,22 @@ static int iopt_map_dmabuf(struct iommufd_ctx *ictx, struct iopt_pages *pages, mutex_unlock(&pages->mutex); }
- rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys); + rc = dma_buf_pin(attach); if (rc) goto err_detach;
+ rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys); + if (rc) + goto err_unpin; + dma_resv_unlock(dmabuf->resv);
/* On success iopt_release_pages() will detach and put the dmabuf. */ pages->dmabuf.attach = attach; return 0;
+err_unpin: + dma_buf_unpin(attach); err_detach: dma_resv_unlock(dmabuf->resv); dma_buf_detach(dmabuf, attach); @@ -1656,6 +1662,7 @@ void iopt_release_pages(struct kref *kref) if (iopt_is_dmabuf(pages) && pages->dmabuf.attach) { struct dma_buf *dmabuf = pages->dmabuf.attach->dmabuf;
+ dma_buf_unpin(pages->dmabuf.attach); dma_buf_detach(dmabuf, pages->dmabuf.attach); dma_buf_put(dmabuf); WARN_ON(!list_empty(&pages->dmabuf.tracker));
From: Leon Romanovsky leon@kernel.org Sent: Sunday, January 25, 2026 3:14 AM
From: Leon Romanovsky leonro@nvidia.com
IOMMUFD relies on a private protocol with VFIO, and this always operated in pinned mode.
Now that VFIO can support pinned importers update IOMMUFD to invoke the normal dma-buf flow to request pin.
This isn't enough to allow IOMMUFD to work with other exporters, it still needs a way to get the physical address list which is another series.
IOMMUFD supports the defined revoke semantics. It immediately stops and fences access to the memory inside it's invalidate_mappings() callback, and it currently doesn't use scatterlists so doesn't call map/unmap at all.
It is expected that a future revision can synchronously call unmap from the move_notify callback as well.
Acked-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Sat, Jan 24, 2026 at 09:14:20PM +0200, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
IOMMUFD relies on a private protocol with VFIO, and this always operated in pinned mode.
Now that VFIO can support pinned importers update IOMMUFD to invoke the normal dma-buf flow to request pin.
This isn't enough to allow IOMMUFD to work with other exporters, it still needs a way to get the physical address list which is another series.
IOMMUFD supports the defined revoke semantics. It immediately stops and fences access to the memory inside it's invalidate_mappings() callback, and it currently doesn't use scatterlists so doesn't call map/unmap at all.
It is expected that a future revision can synchronously call unmap from the move_notify callback as well.
Acked-by: Christian König christian.koenig@amd.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/iommu/iommufd/pages.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
linaro-mm-sig@lists.linaro.org