This patch series introduces dma-buf export support for RDMA/InfiniBand devices, enabling userspace applications to export RDMA PCI-backed memory regions (such as device memory or mlx5 UAR pages) as dma-buf file descriptors.
This allows PCI device memory to be shared with other kernel subsystems (e.g., graphics or media) or between userspace processes, via the standard dma-buf interface, avoiding unnecessary copies and enabling efficient peer-to-peer (P2P) DMA transfers. See [1] for background on dma-buf.
As part of this series, we introduce a new uverbs object of type FD for dma-buf export, along with the corresponding APIs for allocation and teardown. This object encapsulates all attributes required to export a dma-buf.
The implementation enforces P2P-only mappings and properly manages resource lifecycle, including: - Cleanup during driver removal or RDMA context destruction. - Revocation via dma_buf_move_notify() when the underlying mmap entries are removed. - Refactors common cleanup logic for reuse across FD uobject types.
The infrastructure is generic within uverbs, allowing individual drivers to easily integrate and supply their vendor-specific implementation.
The mlx5 driver is the first consumer of this new API, providing: - Initialization of PCI peer-to-peer DMA support. - mlx5-specific implementations of the mmap_get_pfns and pgoff_to_mmap_entry device operations required for dma-buf export.
[1] https://docs.kernel.org/driver-api/dma-buf.html
Signed-off-by: Yishai Hadas yishaih@nvidia.com Signed-off-by: Edward Srouji edwards@nvidia.com --- Yishai Hadas (2): RDMA/uverbs: Add DMABUF object type and operations RDMA/mlx5: Implement DMABUF export ops
drivers/infiniband/core/Makefile | 1 + drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/ib_core_uverbs.c | 19 +++ drivers/infiniband/core/rdma_core.c | 63 ++++---- drivers/infiniband/core/rdma_core.h | 1 + drivers/infiniband/core/uverbs.h | 10 ++ drivers/infiniband/core/uverbs_std_types_dmabuf.c | 172 ++++++++++++++++++++++ drivers/infiniband/core/uverbs_uapi.c | 1 + drivers/infiniband/hw/mlx5/main.c | 72 +++++++++ include/rdma/ib_verbs.h | 9 ++ include/rdma/uverbs_types.h | 1 + include/uapi/rdma/ib_user_ioctl_cmds.h | 10 ++ 12 files changed, 335 insertions(+), 26 deletions(-) --- base-commit: 325e3b5431ddd27c5f93156b36838a351e3b2f72 change-id: 20260108-dmabuf-export-0d598058dd1e
Best regards,
From: Yishai Hadas yishaih@nvidia.com
Expose DMABUF functionality to userspace through the uverbs interface, enabling InfiniBand/RDMA devices to export PCI based memory regions (e.g. device memory) as DMABUF file descriptors. This allows zero-copy sharing of RDMA memory with other subsystems that support the dma-buf framework.
A new UVERBS_OBJECT_DMABUF object type and allocation method were introduced.
During allocation, uverbs invokes the driver to supply the rdma_user_mmap_entry associated with the given page offset (pgoff).
Based on the returned rdma_user_mmap_entry, uverbs requests the driver to provide the corresponding physical-memory details as well as the driver’s PCI provider information.
Using this information, dma_buf_export() is called; if it succeeds, uobj->object is set to the underlying file pointer returned by the dma-buf framework.
The file descriptor number follows the standard uverbs allocation flow, but the file pointer comes from the dma-buf subsystem, including its own fops and private data.
Because of this, alloc_begin_fd_uobject() must handle cases where fd_type->fops is NULL, and both alloc_commit_fd_uobject() and alloc_abort_fd_uobject() must account for whether filp->private_data exists, since it is only populated after a successful dma_buf_export().
When an mmap entry is removed, uverbs iterates over its associated DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that their importers are notified.
The same procedure applies during the disassociate flow; final cleanup occurs when the application closes the file.
Signed-off-by: Yishai Hadas yishaih@nvidia.com Signed-off-by: Edward Srouji edwards@nvidia.com --- drivers/infiniband/core/Makefile | 1 + drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/ib_core_uverbs.c | 19 +++ drivers/infiniband/core/rdma_core.c | 63 ++++---- drivers/infiniband/core/rdma_core.h | 1 + drivers/infiniband/core/uverbs.h | 10 ++ drivers/infiniband/core/uverbs_std_types_dmabuf.c | 172 ++++++++++++++++++++++ drivers/infiniband/core/uverbs_uapi.c | 1 + include/rdma/ib_verbs.h | 9 ++ include/rdma/uverbs_types.h | 1 + include/uapi/rdma/ib_user_ioctl_cmds.h | 10 ++ 11 files changed, 263 insertions(+), 26 deletions(-)
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index f483e0c12444..a2a7a9d2e0d3 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -33,6 +33,7 @@ ib_umad-y := user_mad.o ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \ rdma_core.o uverbs_std_types.o uverbs_ioctl.o \ uverbs_std_types_cq.o \ + uverbs_std_types_dmabuf.o \ uverbs_std_types_dmah.o \ uverbs_std_types_flow_action.o uverbs_std_types_dm.o \ uverbs_std_types_mr.o uverbs_std_types_counters.o \ diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 4e09f6e0995e..416242b9c158 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -2765,6 +2765,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, map_mr_sg); SET_DEVICE_OP(dev_ops, map_mr_sg_pi); SET_DEVICE_OP(dev_ops, mmap); + SET_DEVICE_OP(dev_ops, mmap_get_pfns); SET_DEVICE_OP(dev_ops, mmap_free); SET_DEVICE_OP(dev_ops, modify_ah); SET_DEVICE_OP(dev_ops, modify_cq); @@ -2775,6 +2776,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, modify_srq); SET_DEVICE_OP(dev_ops, modify_wq); SET_DEVICE_OP(dev_ops, peek_cq); + SET_DEVICE_OP(dev_ops, pgoff_to_mmap_entry); SET_DEVICE_OP(dev_ops, pre_destroy_cq); SET_DEVICE_OP(dev_ops, poll_cq); SET_DEVICE_OP(dev_ops, port_groups); diff --git a/drivers/infiniband/core/ib_core_uverbs.c b/drivers/infiniband/core/ib_core_uverbs.c index b51bd7087a88..1ff53b8a0e89 100644 --- a/drivers/infiniband/core/ib_core_uverbs.c +++ b/drivers/infiniband/core/ib_core_uverbs.c @@ -5,9 +5,13 @@ * Copyright 2019 Marvell. All rights reserved. */ #include <linux/xarray.h> +#include <linux/dma-buf.h> +#include <linux/dma-resv.h> #include "uverbs.h" #include "core_priv.h"
+MODULE_IMPORT_NS("DMA_BUF"); + /** * rdma_umap_priv_init() - Initialize the private data of a vma * @@ -229,12 +233,24 @@ EXPORT_SYMBOL(rdma_user_mmap_entry_put); */ void rdma_user_mmap_entry_remove(struct rdma_user_mmap_entry *entry) { + struct ib_uverbs_dmabuf_file *uverbs_dmabuf, *tmp; + if (!entry) return;
+ mutex_lock(&entry->dmabufs_lock); xa_lock(&entry->ucontext->mmap_xa); entry->driver_removed = true; xa_unlock(&entry->ucontext->mmap_xa); + list_for_each_entry_safe(uverbs_dmabuf, tmp, &entry->dmabufs, dmabufs_elm) { + dma_resv_lock(uverbs_dmabuf->dmabuf->resv, NULL); + list_del(&uverbs_dmabuf->dmabufs_elm); + uverbs_dmabuf->revoked = true; + dma_buf_move_notify(uverbs_dmabuf->dmabuf); + dma_resv_unlock(uverbs_dmabuf->dmabuf->resv); + } + mutex_unlock(&entry->dmabufs_lock); + kref_put(&entry->ref, rdma_user_mmap_entry_free); } EXPORT_SYMBOL(rdma_user_mmap_entry_remove); @@ -274,6 +290,9 @@ int rdma_user_mmap_entry_insert_range(struct ib_ucontext *ucontext, return -EINVAL;
kref_init(&entry->ref); + INIT_LIST_HEAD(&entry->dmabufs); + mutex_init(&entry->dmabufs_lock); + entry->ucontext = ucontext;
/* diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 18918f463361..3e0a8b9cd288 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -465,7 +465,7 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj,
fd_type = container_of(obj->type_attrs, struct uverbs_obj_fd_type, type); - if (WARN_ON(fd_type->fops->release != &uverbs_uobject_fd_release && + if (WARN_ON(fd_type->fops && fd_type->fops->release != &uverbs_uobject_fd_release && fd_type->fops->release != &uverbs_async_event_release)) { ret = ERR_PTR(-EINVAL); goto err_fd; @@ -477,14 +477,16 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj, goto err_fd; }
- /* Note that uverbs_uobject_fd_release() is called during abort */ - filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL, - fd_type->flags); - if (IS_ERR(filp)) { - ret = ERR_CAST(filp); - goto err_getfile; + if (fd_type->fops) { + /* Note that uverbs_uobject_fd_release() is called during abort */ + filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL, + fd_type->flags); + if (IS_ERR(filp)) { + ret = ERR_CAST(filp); + goto err_getfile; + } + uobj->object = filp; } - uobj->object = filp;
uobj->id = new_fd; return uobj; @@ -561,7 +563,9 @@ static void alloc_abort_fd_uobject(struct ib_uobject *uobj) { struct file *filp = uobj->object;
- fput(filp); + if (filp) + fput(filp); + put_unused_fd(uobj->id); }
@@ -628,11 +632,14 @@ static void alloc_commit_fd_uobject(struct ib_uobject *uobj) /* This shouldn't be used anymore. Use the file object instead */ uobj->id = 0;
- /* - * NOTE: Once we install the file we loose ownership of our kref on - * uobj. It will be put by uverbs_uobject_fd_release() - */ - filp->private_data = uobj; + if (!filp->private_data) { + /* + * NOTE: Once we install the file we loose ownership of our kref on + * uobj. It will be put by uverbs_uobject_fd_release() + */ + filp->private_data = uobj; + } + fd_install(fd, filp); }
@@ -802,21 +809,10 @@ const struct uverbs_obj_type_class uverbs_idr_class = { }; EXPORT_SYMBOL(uverbs_idr_class);
-/* - * Users of UVERBS_TYPE_ALLOC_FD should set this function as the struct - * file_operations release method. - */ -int uverbs_uobject_fd_release(struct inode *inode, struct file *filp) +int uverbs_uobject_release(struct ib_uobject *uobj) { struct ib_uverbs_file *ufile; - struct ib_uobject *uobj;
- /* - * This can only happen if the fput came from alloc_abort_fd_uobject() - */ - if (!filp->private_data) - return 0; - uobj = filp->private_data; ufile = uobj->ufile;
if (down_read_trylock(&ufile->hw_destroy_rwsem)) { @@ -843,6 +839,21 @@ int uverbs_uobject_fd_release(struct inode *inode, struct file *filp) uverbs_uobject_put(uobj); return 0; } + +/* + * Users of UVERBS_TYPE_ALLOC_FD should set this function as the struct + * file_operations release method. + */ +int uverbs_uobject_fd_release(struct inode *inode, struct file *filp) +{ + /* + * This can only happen if the fput came from alloc_abort_fd_uobject() + */ + if (!filp->private_data) + return 0; + + return uverbs_uobject_release(filp->private_data); +} EXPORT_SYMBOL(uverbs_uobject_fd_release);
/* diff --git a/drivers/infiniband/core/rdma_core.h b/drivers/infiniband/core/rdma_core.h index a59b087611cb..55f1e3558856 100644 --- a/drivers/infiniband/core/rdma_core.h +++ b/drivers/infiniband/core/rdma_core.h @@ -156,6 +156,7 @@ extern const struct uapi_definition uverbs_def_obj_counters[]; extern const struct uapi_definition uverbs_def_obj_cq[]; extern const struct uapi_definition uverbs_def_obj_device[]; extern const struct uapi_definition uverbs_def_obj_dm[]; +extern const struct uapi_definition uverbs_def_obj_dmabuf[]; extern const struct uapi_definition uverbs_def_obj_dmah[]; extern const struct uapi_definition uverbs_def_obj_flow_action[]; extern const struct uapi_definition uverbs_def_obj_intf[]; diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 797e2fcc8072..66287e8e7ad7 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -133,6 +133,16 @@ struct ib_uverbs_completion_event_file { struct ib_uverbs_event_queue ev_queue; };
+struct ib_uverbs_dmabuf_file { + struct ib_uobject uobj; + struct dma_buf *dmabuf; + struct list_head dmabufs_elm; + struct rdma_user_mmap_entry *mmap_entry; + struct dma_buf_phys_vec phys_vec; + struct p2pdma_provider *provider; + u8 revoked :1; +}; + struct ib_uverbs_event { union { struct ib_uverbs_async_event_desc async; diff --git a/drivers/infiniband/core/uverbs_std_types_dmabuf.c b/drivers/infiniband/core/uverbs_std_types_dmabuf.c new file mode 100644 index 000000000000..ef5484022e77 --- /dev/null +++ b/drivers/infiniband/core/uverbs_std_types_dmabuf.c @@ -0,0 +1,172 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved + */ + +#include <linux/dma-buf-mapping.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> +#include <rdma/uverbs_std_types.h> +#include "rdma_core.h" +#include "uverbs.h" + +static int uverbs_dmabuf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct ib_uverbs_dmabuf_file *priv = dmabuf->priv; + + if (!attachment->peer2peer) + return -EOPNOTSUPP; + + if (priv->revoked) + return -ENODEV; + + return 0; +} + +static struct sg_table * +uverbs_dmabuf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + struct ib_uverbs_dmabuf_file *priv = attachment->dmabuf->priv; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (priv->revoked) + return ERR_PTR(-ENODEV); + + return dma_buf_phys_vec_to_sgt(attachment, priv->provider, + &priv->phys_vec, 1, priv->phys_vec.len, + dir); +} + +static void uverbs_dmabuf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + dma_buf_free_sgt(attachment, sgt, dir); +} + +static void uverbs_dmabuf_release(struct dma_buf *dmabuf) +{ + struct ib_uverbs_dmabuf_file *priv = dmabuf->priv; + + /* + * This can only happen if the fput came from alloc_abort_fd_uobject() + */ + if (!priv->uobj.context) + return; + + uverbs_uobject_release(&priv->uobj); +} + +static const struct dma_buf_ops uverbs_dmabuf_ops = { + .attach = uverbs_dmabuf_attach, + .map_dma_buf = uverbs_dmabuf_map, + .unmap_dma_buf = uverbs_dmabuf_unmap, + .release = uverbs_dmabuf_release, +}; + +static int UVERBS_HANDLER(UVERBS_METHOD_DMABUF_ALLOC)( + struct uverbs_attr_bundle *attrs) +{ + struct ib_uobject *uobj = + uverbs_attr_get(attrs, UVERBS_ATTR_ALLOC_DMABUF_HANDLE) + ->obj_attr.uobject; + struct ib_uverbs_dmabuf_file *uverbs_dmabuf = + container_of(uobj, struct ib_uverbs_dmabuf_file, uobj); + struct ib_device *ib_dev = attrs->context->device; + struct rdma_user_mmap_entry *mmap_entry; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + off_t pg_off; + int ret; + + ret = uverbs_get_const(&pg_off, attrs, UVERBS_ATTR_ALLOC_DMABUF_PGOFF); + if (ret) + return ret; + + mmap_entry = ib_dev->ops.pgoff_to_mmap_entry(attrs->context, pg_off); + if (!mmap_entry) + return -EINVAL; + + ret = ib_dev->ops.mmap_get_pfns(mmap_entry, &uverbs_dmabuf->phys_vec, + &uverbs_dmabuf->provider); + if (ret) + goto err; + + exp_info.ops = &uverbs_dmabuf_ops; + exp_info.size = uverbs_dmabuf->phys_vec.len; + exp_info.flags = O_CLOEXEC; + exp_info.priv = uverbs_dmabuf; + + uverbs_dmabuf->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(uverbs_dmabuf->dmabuf)) { + ret = PTR_ERR(uverbs_dmabuf->dmabuf); + goto err; + } + + INIT_LIST_HEAD(&uverbs_dmabuf->dmabufs_elm); + mutex_lock(&mmap_entry->dmabufs_lock); + if (mmap_entry->driver_removed) + ret = -EIO; + else + list_add_tail(&uverbs_dmabuf->dmabufs_elm, &mmap_entry->dmabufs); + mutex_unlock(&mmap_entry->dmabufs_lock); + if (ret) + goto err_revoked; + + uobj->object = uverbs_dmabuf->dmabuf->file; + uverbs_dmabuf->mmap_entry = mmap_entry; + uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_ALLOC_DMABUF_HANDLE); + return 0; + +err_revoked: + dma_buf_put(uverbs_dmabuf->dmabuf); +err: + rdma_user_mmap_entry_put(mmap_entry); + return ret; +} + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_DMABUF_ALLOC, + UVERBS_ATTR_FD(UVERBS_ATTR_ALLOC_DMABUF_HANDLE, + UVERBS_OBJECT_DMABUF, + UVERBS_ACCESS_NEW, + UA_MANDATORY), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_ALLOC_DMABUF_PGOFF, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY)); + +static void uverbs_dmabuf_fd_destroy_uobj(struct ib_uobject *uobj, + enum rdma_remove_reason why) +{ + struct ib_uverbs_dmabuf_file *uverbs_dmabuf = + container_of(uobj, struct ib_uverbs_dmabuf_file, uobj); + + mutex_lock(&uverbs_dmabuf->mmap_entry->dmabufs_lock); + dma_resv_lock(uverbs_dmabuf->dmabuf->resv, NULL); + if (!uverbs_dmabuf->revoked) { + uverbs_dmabuf->revoked = true; + list_del(&uverbs_dmabuf->dmabufs_elm); + dma_buf_move_notify(uverbs_dmabuf->dmabuf); + } + dma_resv_unlock(uverbs_dmabuf->dmabuf->resv); + mutex_unlock(&uverbs_dmabuf->mmap_entry->dmabufs_lock); + + /* Matches the get done as part of pgoff_to_mmap_entry() */ + rdma_user_mmap_entry_put(uverbs_dmabuf->mmap_entry); +}; + +DECLARE_UVERBS_NAMED_OBJECT( + UVERBS_OBJECT_DMABUF, + UVERBS_TYPE_ALLOC_FD(sizeof(struct ib_uverbs_dmabuf_file), + uverbs_dmabuf_fd_destroy_uobj, + NULL, NULL, O_RDONLY), + &UVERBS_METHOD(UVERBS_METHOD_DMABUF_ALLOC)); + +const struct uapi_definition uverbs_def_obj_dmabuf[] = { + UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_DMABUF), + UAPI_DEF_OBJ_NEEDS_FN(mmap_get_pfns), + UAPI_DEF_OBJ_NEEDS_FN(pgoff_to_mmap_entry), + {} +}; diff --git a/drivers/infiniband/core/uverbs_uapi.c b/drivers/infiniband/core/uverbs_uapi.c index e00ea63175bd..38d0bbbee796 100644 --- a/drivers/infiniband/core/uverbs_uapi.c +++ b/drivers/infiniband/core/uverbs_uapi.c @@ -631,6 +631,7 @@ static const struct uapi_definition uverbs_core_api[] = { UAPI_DEF_CHAIN(uverbs_def_obj_cq), UAPI_DEF_CHAIN(uverbs_def_obj_device), UAPI_DEF_CHAIN(uverbs_def_obj_dm), + UAPI_DEF_CHAIN(uverbs_def_obj_dmabuf), UAPI_DEF_CHAIN(uverbs_def_obj_dmah), UAPI_DEF_CHAIN(uverbs_def_obj_flow_action), UAPI_DEF_CHAIN(uverbs_def_obj_intf), diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 6c372a37c482..5be67013c8ae 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -43,6 +43,7 @@ #include <uapi/rdma/rdma_user_ioctl.h> #include <uapi/rdma/ib_user_ioctl_verbs.h> #include <linux/pci-tph.h> +#include <linux/dma-buf.h>
#define IB_FW_VERSION_NAME_MAX ETHTOOL_FWVERS_LEN
@@ -2363,6 +2364,9 @@ struct rdma_user_mmap_entry { unsigned long start_pgoff; size_t npages; bool driver_removed; + /* protects access to dmabufs */ + struct mutex dmabufs_lock; + struct list_head dmabufs; };
/* Return the offset (in bytes) the user should pass to libc's mmap() */ @@ -2500,6 +2504,11 @@ struct ib_device_ops { * Therefore needs to be implemented by the driver in mmap_free. */ void (*mmap_free)(struct rdma_user_mmap_entry *entry); + int (*mmap_get_pfns)(struct rdma_user_mmap_entry *entry, + struct dma_buf_phys_vec *phys_vec, + struct p2pdma_provider **provider); + struct rdma_user_mmap_entry *(*pgoff_to_mmap_entry)(struct ib_ucontext *ucontext, + off_t pg_off); void (*disassociate_ucontext)(struct ib_ucontext *ibcontext); int (*alloc_pd)(struct ib_pd *pd, struct ib_udata *udata); int (*dealloc_pd)(struct ib_pd *pd, struct ib_udata *udata); diff --git a/include/rdma/uverbs_types.h b/include/rdma/uverbs_types.h index 26ba919ac245..6a253b7dc5ea 100644 --- a/include/rdma/uverbs_types.h +++ b/include/rdma/uverbs_types.h @@ -186,6 +186,7 @@ struct ib_uverbs_file { extern const struct uverbs_obj_type_class uverbs_idr_class; extern const struct uverbs_obj_type_class uverbs_fd_class; int uverbs_uobject_fd_release(struct inode *inode, struct file *filp); +int uverbs_uobject_release(struct ib_uobject *uobj);
#define UVERBS_BUILD_BUG_ON(cond) (sizeof(char[1 - 2 * !!(cond)]) - \ sizeof(char)) diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h index 35da4026f452..72041c1b0ea5 100644 --- a/include/uapi/rdma/ib_user_ioctl_cmds.h +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h @@ -56,6 +56,7 @@ enum uverbs_default_objects { UVERBS_OBJECT_COUNTERS, UVERBS_OBJECT_ASYNC_EVENT, UVERBS_OBJECT_DMAH, + UVERBS_OBJECT_DMABUF, };
enum { @@ -263,6 +264,15 @@ enum uverbs_methods_dmah { UVERBS_METHOD_DMAH_FREE, };
+enum uverbs_attrs_alloc_dmabuf_cmd_attr_ids { + UVERBS_ATTR_ALLOC_DMABUF_HANDLE, + UVERBS_ATTR_ALLOC_DMABUF_PGOFF, +}; + +enum uverbs_methods_dmabuf { + UVERBS_METHOD_DMABUF_ALLOC, +}; + enum uverbs_attrs_reg_dm_mr_cmd_attr_ids { UVERBS_ATTR_REG_DM_MR_HANDLE, UVERBS_ATTR_REG_DM_MR_OFFSET,
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
void rdma_user_mmap_entry_remove(struct rdma_user_mmap_entry *entry) {
- struct ib_uverbs_dmabuf_file *uverbs_dmabuf, *tmp;
- if (!entry) return;
- mutex_lock(&entry->dmabufs_lock); xa_lock(&entry->ucontext->mmap_xa); entry->driver_removed = true; xa_unlock(&entry->ucontext->mmap_xa);
- list_for_each_entry_safe(uverbs_dmabuf, tmp, &entry->dmabufs, dmabufs_elm) {
dma_resv_lock(uverbs_dmabuf->dmabuf->resv, NULL);list_del(&uverbs_dmabuf->dmabufs_elm);uverbs_dmabuf->revoked = true;dma_buf_move_notify(uverbs_dmabuf->dmabuf);dma_resv_unlock(uverbs_dmabuf->dmabuf->resv);
This will need the same wait that Christian pointed out for VFIO..
diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 18918f463361..3e0a8b9cd288 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -465,7 +465,7 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj, fd_type = container_of(obj->type_attrs, struct uverbs_obj_fd_type, type);
- if (WARN_ON(fd_type->fops->release != &uverbs_uobject_fd_release &&
- if (WARN_ON(fd_type->fops && fd_type->fops->release != &uverbs_uobject_fd_release && fd_type->fops->release != &uverbs_async_event_release)) { ret = ERR_PTR(-EINVAL); goto err_fd;
@@ -477,14 +477,16 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj, goto err_fd; }
- /* Note that uverbs_uobject_fd_release() is called during abort */
- filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL,
fd_type->flags);- if (IS_ERR(filp)) {
ret = ERR_CAST(filp);goto err_getfile;
- if (fd_type->fops) {
/* Note that uverbs_uobject_fd_release() is called during abort */filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL,fd_type->flags);if (IS_ERR(filp)) {ret = ERR_CAST(filp);goto err_getfile;} }uobj->object = filp;
- uobj->object = filp;
uobj->id = new_fd; return uobj; @@ -561,7 +563,9 @@ static void alloc_abort_fd_uobject(struct ib_uobject *uobj) { struct file *filp = uobj->object;
- fput(filp);
- if (filp)
fput(filp);- put_unused_fd(uobj->id);
This stuff changing hw the uobjects work should probably be in its own patch with its own explanation about creating a uobject that wrappers an externally allocated file descriptor vs this automatic internal allocation.
index 797e2fcc8072..66287e8e7ad7 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -133,6 +133,16 @@ struct ib_uverbs_completion_event_file { struct ib_uverbs_event_queue ev_queue; }; +struct ib_uverbs_dmabuf_file {
- struct ib_uobject uobj;
- struct dma_buf *dmabuf;
- struct list_head dmabufs_elm;
- struct rdma_user_mmap_entry *mmap_entry;
- struct dma_buf_phys_vec phys_vec;
Oh, are we going to have weird merge conflicts with this Leon?
+static int uverbs_dmabuf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)+{
- struct ib_uverbs_dmabuf_file *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;- if (priv->revoked)
return -ENODEV;
This should only be checked in map
This should also eventually call the new revoke testing function Leon is adding
Jason
On Tue, Jan 20, 2026 at 02:15:20PM -0400, Jason Gunthorpe wrote:
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
void rdma_user_mmap_entry_remove(struct rdma_user_mmap_entry *entry) {
- struct ib_uverbs_dmabuf_file *uverbs_dmabuf, *tmp;
- if (!entry) return;
- mutex_lock(&entry->dmabufs_lock); xa_lock(&entry->ucontext->mmap_xa); entry->driver_removed = true; xa_unlock(&entry->ucontext->mmap_xa);
- list_for_each_entry_safe(uverbs_dmabuf, tmp, &entry->dmabufs, dmabufs_elm) {
dma_resv_lock(uverbs_dmabuf->dmabuf->resv, NULL);list_del(&uverbs_dmabuf->dmabufs_elm);uverbs_dmabuf->revoked = true;dma_buf_move_notify(uverbs_dmabuf->dmabuf);dma_resv_unlock(uverbs_dmabuf->dmabuf->resv);This will need the same wait that Christian pointed out for VFIO..
Yes, something like this is missing https://lore.kernel.org/all/20260120-dmabuf-revoke-v3-6-b7e0b07b8214@nvidia....
<...>
+struct ib_uverbs_dmabuf_file {
- struct ib_uobject uobj;
- struct dma_buf *dmabuf;
- struct list_head dmabufs_elm;
- struct rdma_user_mmap_entry *mmap_entry;
- struct dma_buf_phys_vec phys_vec;
Oh, are we going to have weird merge conflicts with this Leon?
No, Alex created a shared branch with the rename already applied for me. I had planned to merge it into the RDMA tree before taking this series, and then update dma_buf_phys_vec to phys_vec locally.
+static int uverbs_dmabuf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)+{
- struct ib_uverbs_dmabuf_file *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;- if (priv->revoked)
return -ENODEV;This should only be checked in map
I disagree with word "only", the more accurate word is "too". There is no need to allow new importer attach if this exporter is marked as revoked.
This should also eventually call the new revoke testing function Leon is adding
We will add it once my series will be accepted.
Thanks
Jason
On Wed, Jan 21, 2026 at 10:32:46AM +0200, Leon Romanovsky wrote:
+static int uverbs_dmabuf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)+{
- struct ib_uverbs_dmabuf_file *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;- if (priv->revoked)
return -ENODEV;This should only be checked in map
I disagree with word "only", the more accurate word is "too". There is no need to allow new importer attach if this exporter is marked as revoked.
It must check during map, during attach as well is redundant and a bit confusing.
This should also eventually call the new revoke testing function Leon is adding
We will add it once my series will be accepted.
It should also refuse pinned importers with an always fail pin op until we get that done. This is a case like VFIO where the lifecycle is more general and I don't want to accidently allow things that shouldn't work.
Jason
On 21/01/2026 15:56, Jason Gunthorpe wrote:
On Wed, Jan 21, 2026 at 10:32:46AM +0200, Leon Romanovsky wrote:
+static int uverbs_dmabuf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)+{
- struct ib_uverbs_dmabuf_file *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;- if (priv->revoked)
return -ENODEV;This should only be checked in map
I disagree with word "only", the more accurate word is "too". There is no need to allow new importer attach if this exporter is marked as revoked.
It must check during map, during attach as well is redundant and a bit confusing.
OK, let's drop this check as part of the 'attach'.
This should also eventually call the new revoke testing function Leon is adding
We will add it once my series will be accepted.
It should also refuse pinned importers with an always fail pin op until we get that done. This is a case like VFIO where the lifecycle is more general and I don't want to accidently allow things that shouldn't work.
Sure, will be part of V1.
Thanks, Yishai
On 20/01/2026 20:15, Jason Gunthorpe wrote:
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
void rdma_user_mmap_entry_remove(struct rdma_user_mmap_entry *entry) {
- struct ib_uverbs_dmabuf_file *uverbs_dmabuf, *tmp;
- if (!entry) return;
- mutex_lock(&entry->dmabufs_lock); xa_lock(&entry->ucontext->mmap_xa); entry->driver_removed = true; xa_unlock(&entry->ucontext->mmap_xa);
- list_for_each_entry_safe(uverbs_dmabuf, tmp, &entry->dmabufs, dmabufs_elm) {
dma_resv_lock(uverbs_dmabuf->dmabuf->resv, NULL);list_del(&uverbs_dmabuf->dmabufs_elm);uverbs_dmabuf->revoked = true;dma_buf_move_notify(uverbs_dmabuf->dmabuf);dma_resv_unlock(uverbs_dmabuf->dmabuf->resv);This will need the same wait that Christian pointed out for VFIO..
Sure, I'll add.
diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 18918f463361..3e0a8b9cd288 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -465,7 +465,7 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj, fd_type = container_of(obj->type_attrs, struct uverbs_obj_fd_type, type);
- if (WARN_ON(fd_type->fops->release != &uverbs_uobject_fd_release &&
- if (WARN_ON(fd_type->fops && fd_type->fops->release != &uverbs_uobject_fd_release && fd_type->fops->release != &uverbs_async_event_release)) { ret = ERR_PTR(-EINVAL); goto err_fd;
@@ -477,14 +477,16 @@ alloc_begin_fd_uobject(const struct uverbs_api_object *obj, goto err_fd; }
- /* Note that uverbs_uobject_fd_release() is called during abort */
- filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL,
fd_type->flags);- if (IS_ERR(filp)) {
ret = ERR_CAST(filp);goto err_getfile;
- if (fd_type->fops) {
/* Note that uverbs_uobject_fd_release() is called during abort */filp = anon_inode_getfile(fd_type->name, fd_type->fops, NULL,fd_type->flags);if (IS_ERR(filp)) {ret = ERR_CAST(filp);goto err_getfile;} }uobj->object = filp;
- uobj->object = filp;
uobj->id = new_fd; return uobj; @@ -561,7 +563,9 @@ static void alloc_abort_fd_uobject(struct ib_uobject *uobj) { struct file *filp = uobj->object;
- fput(filp);
- if (filp)
fput(filp);- put_unused_fd(uobj->id);
This stuff changing hw the uobjects work should probably be in its own patch with its own explanation about creating a uobject that wrappers an externally allocated file descriptor vs this automatic internal allocation.
Sure, I’ll split the current patch into two patches.
index 797e2fcc8072..66287e8e7ad7 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -133,6 +133,16 @@ struct ib_uverbs_completion_event_file { struct ib_uverbs_event_queue ev_queue; }; +struct ib_uverbs_dmabuf_file {
- struct ib_uobject uobj;
- struct dma_buf *dmabuf;
- struct list_head dmabufs_elm;
- struct rdma_user_mmap_entry *mmap_entry;
- struct dma_buf_phys_vec phys_vec;
Oh, are we going to have weird merge conflicts with this Leon?
+static int uverbs_dmabuf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)+{
- struct ib_uverbs_dmabuf_file *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;- if (priv->revoked)
return -ENODEV;This should only be checked in map
Please see Leon's answer on that.
Yishai
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
From: Yishai Hadas yishaih@nvidia.com
Expose DMABUF functionality to userspace through the uverbs interface, enabling InfiniBand/RDMA devices to export PCI based memory regions (e.g. device memory) as DMABUF file descriptors. This allows zero-copy sharing of RDMA memory with other subsystems that support the dma-buf framework.
A new UVERBS_OBJECT_DMABUF object type and allocation method were introduced.
During allocation, uverbs invokes the driver to supply the rdma_user_mmap_entry associated with the given page offset (pgoff).
Based on the returned rdma_user_mmap_entry, uverbs requests the driver to provide the corresponding physical-memory details as well as the driver’s PCI provider information.
Using this information, dma_buf_export() is called; if it succeeds, uobj->object is set to the underlying file pointer returned by the dma-buf framework.
The file descriptor number follows the standard uverbs allocation flow, but the file pointer comes from the dma-buf subsystem, including its own fops and private data.
Because of this, alloc_begin_fd_uobject() must handle cases where fd_type->fops is NULL, and both alloc_commit_fd_uobject() and alloc_abort_fd_uobject() must account for whether filp->private_data exists, since it is only populated after a successful dma_buf_export().
When an mmap entry is removed, uverbs iterates over its associated DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that their importers are notified.
The same procedure applies during the disassociate flow; final cleanup occurs when the application closes the file.
Signed-off-by: Yishai Hadas yishaih@nvidia.com Signed-off-by: Edward Srouji edwards@nvidia.com
drivers/infiniband/core/Makefile | 1 + drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/ib_core_uverbs.c | 19 +++ drivers/infiniband/core/rdma_core.c | 63 ++++---- drivers/infiniband/core/rdma_core.h | 1 + drivers/infiniband/core/uverbs.h | 10 ++ drivers/infiniband/core/uverbs_std_types_dmabuf.c | 172 ++++++++++++++++++++++ drivers/infiniband/core/uverbs_uapi.c | 1 + include/rdma/ib_verbs.h | 9 ++ include/rdma/uverbs_types.h | 1 + include/uapi/rdma/ib_user_ioctl_cmds.h | 10 ++ 11 files changed, 263 insertions(+), 26 deletions(-)
<...>
+static struct sg_table * +uverbs_dmabuf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)+{
- struct ib_uverbs_dmabuf_file *priv = attachment->dmabuf->priv;
- dma_resv_assert_held(priv->dmabuf->resv);
- if (priv->revoked)
return ERR_PTR(-ENODEV);- return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
&priv->phys_vec, 1, priv->phys_vec.len,dir);+}
+static void uverbs_dmabuf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,enum dma_data_direction dir)+{
- dma_buf_free_sgt(attachment, sgt, dir);
+}
Unfortunately, it is not enough. Exporters should count their map<->unmap calls and make sure that they are equal.
See this VFIO change https://lore.kernel.org/kvm/20260124-dmabuf-revoke-v5-4-f98fca917e96@nvidia....
Thanks
From: Yishai Hadas yishaih@nvidia.com
Enable p2pdma on the mlx5 PCI device to allow DMABUF-based peer-to-peer DMA mappings.
Add implementation of the mmap_get_pfns and pgoff_to_mmap_entry device operations required for DMABUF support in the mlx5 RDMA driver.
The pgoff_to_mmap_entry operation converts a page offset to the corresponding rdma_user_mmap_entry by extracting the command and index from the offset and looking it up in the ucontext's mmap_xa.
The mmap_get_pfns operation retrieves the physical address and length from the mmap entry and obtains the p2pdma provider for the underlying PCI device, which is needed for peer-to-peer DMA operations with DMABUFs.
Signed-off-by: Yishai Hadas yishaih@nvidia.com Signed-off-by: Edward Srouji edwards@nvidia.com --- drivers/infiniband/hw/mlx5/main.c | 72 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index e81080622283..f97c86c96d83 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -2446,6 +2446,70 @@ static int mlx5_ib_mmap_clock_info_page(struct mlx5_ib_dev *dev, virt_to_page(dev->mdev->clock_info)); }
+static int phys_addr_to_bar(struct pci_dev *pdev, phys_addr_t pa) +{ + resource_size_t start, end; + int bar; + + for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) { + /* Skip BARs not present or not memory-mapped */ + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + continue; + + start = pci_resource_start(pdev, bar); + end = pci_resource_end(pdev, bar); + + if (!start || !end) + continue; + + if (pa >= start && pa <= end) + return bar; + } + + return -1; +} + +static int mlx5_ib_mmap_get_pfns(struct rdma_user_mmap_entry *entry, + struct dma_buf_phys_vec *phys_vec, + struct p2pdma_provider **provider) +{ + struct mlx5_user_mmap_entry *mentry = to_mmmap(entry); + struct pci_dev *pdev = to_mdev(entry->ucontext->device)->mdev->pdev; + int bar; + + phys_vec->paddr = mentry->address; + phys_vec->len = entry->npages * PAGE_SIZE; + + bar = phys_addr_to_bar(pdev, phys_vec->paddr); + if (bar < 0) + return -EINVAL; + + *provider = pcim_p2pdma_provider(pdev, bar); + /* If the kernel was not compiled with CONFIG_PCI_P2PDMA the + * functionality is not supported. + */ + if (!*provider) + return -EOPNOTSUPP; + + return 0; +} + +static struct rdma_user_mmap_entry * +mlx5_ib_pgoff_to_mmap_entry(struct ib_ucontext *ucontext, off_t pg_off) +{ + unsigned long entry_pgoff; + unsigned long idx; + u8 command; + + pg_off = pg_off >> PAGE_SHIFT; + command = get_command(pg_off); + idx = get_extended_index(pg_off); + + entry_pgoff = command << 16 | idx; + + return rdma_user_mmap_entry_get_pgoff(ucontext, entry_pgoff); +} + static void mlx5_ib_mmap_free(struct rdma_user_mmap_entry *entry) { struct mlx5_user_mmap_entry *mentry = to_mmmap(entry); @@ -4360,7 +4424,13 @@ static int mlx5_ib_stage_init_init(struct mlx5_ib_dev *dev) if (err) goto err_mp;
+ err = pcim_p2pdma_init(mdev->pdev); + if (err && err != -EOPNOTSUPP) + goto err_dd; + return 0; +err_dd: + mlx5_ib_data_direct_cleanup(dev); err_mp: mlx5_ib_cleanup_multiport_master(dev); err: @@ -4412,11 +4482,13 @@ static const struct ib_device_ops mlx5_ib_dev_ops = { .map_mr_sg_pi = mlx5_ib_map_mr_sg_pi, .mmap = mlx5_ib_mmap, .mmap_free = mlx5_ib_mmap_free, + .mmap_get_pfns = mlx5_ib_mmap_get_pfns, .modify_cq = mlx5_ib_modify_cq, .modify_device = mlx5_ib_modify_device, .modify_port = mlx5_ib_modify_port, .modify_qp = mlx5_ib_modify_qp, .modify_srq = mlx5_ib_modify_srq, + .pgoff_to_mmap_entry = mlx5_ib_pgoff_to_mmap_entry, .pre_destroy_cq = mlx5_ib_pre_destroy_cq, .poll_cq = mlx5_ib_poll_cq, .post_destroy_cq = mlx5_ib_post_destroy_cq,
On Thu, Jan 08, 2026 at 01:11:15PM +0200, Edward Srouji wrote:
+static int phys_addr_to_bar(struct pci_dev *pdev, phys_addr_t pa) +{
- resource_size_t start, end;
- int bar;
- for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
/* Skip BARs not present or not memory-mapped */if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))continue;start = pci_resource_start(pdev, bar);end = pci_resource_end(pdev, bar);if (!start || !end)continue;if (pa >= start && pa <= end)return bar;- }
Don't we know which of the two BARs the mmap entry came from based on its type? This seems like overkill..
Jason
linaro-mm-sig@lists.linaro.org