In a typical dma-buf use case, a dmabuf exporter makes its buffer
buffer available to an importer by mapping it using DMA APIs
such as dma_map_sgtable() or dma_map_resource(). However, this
is not desirable in some cases where the exporter and importer
are directly connected via a physical or virtual link (or
interconnect) and the importer can access the buffer without
having it DMA mapped.
So, to address this scenario, this patch series adds APIs to map/
unmap dmabufs via interconnects and also provides a helper to
identify the first common interconnect between the exporter and
importer. Furthermore, this patch series also adds support for
IOV interconnect in the vfio-pci driver and Intel Xe driver.
The IOV interconnect is a virtual interconnect between an SRIOV
physical function (PF) and its virtual functions (VFs). And, for
the IOV interconnect, the addresses associated with a buffer are
shared using an xarray (instead of an sg_table) that is populated
with entries of type struct range.
The dma-buf patches in this series are based on ideas/suggestions
provided by Jason Gunthorpe, Christian Koenig and Thomas Hellström.
Changelog:
RFC -> RFCv2:
- Add documentation for the new dma-buf APIs and types (Thomas)
- Change the interconnect type from enum to unique pointer (Thomas)
- Moved the new dma-buf APIs to a separate file
- Store a copy of the interconnect matching data in the attachment
- Simplified the macros to create and match interconnects
- Use struct device instead of struct pci_dev in match data
- Replace DRM_INTERCONNECT_DRIVER with XE_INTERCONNECT_VRAM during
address encoding (Matt, Thomas)
- Drop is_devmem_external and instead rely on bo->dma_data.dma_addr
to check for imported VRAM BOs (Matt)
- Pass XE_PAGE_SIZE as the last parameter to xe_bo_addr (Matt)
- Add a check to prevent malicious VF from accessing other VF's
addresses (Thomas)
- Fallback to legacy (map_dma_buf) mapping method if mapping via
interconnect fails
Patchset overview:
Patch 1-3: Add dma-buf APIs to map/unmap and match
Patch 4: Add support for IOV interconnect in vfio-pci driver
Patch 5: Add support for IOV interconnect in Xe driver
Patch 6-8: Create and use a new dma_addr array for LMEM based
dmabuf BOs to store translated addresses (DPAs)
This series is rebased on top of the following repo:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Associated Qemu patch series:
https://lore.kernel.org/qemu-devel/20251003234138.85820-1-vivek.kasireddy@i…
Associated vfio-pci patch series:
https://lore.kernel.org/dri-devel/cover.1760368250.git.leon@kernel.org/
This series is tested using the following method:
- Run Qemu with the following relevant options:
qemu-system-x86_64 -m 4096m ....
-device ioh3420,id=root_port1,bus=pcie.0
-device x3130-upstream,id=upstream1,bus=root_port1
-device xio3130-downstream,id=downstream1,bus=upstream1,chassis=9
-device xio3130-downstream,id=downstream2,bus=upstream1,chassis=10
-device vfio-pci,host=0000:03:00.1,bus=downstream1
-device virtio-gpu,max_outputs=1,blob=true,xres=1920,yres=1080,bus=downstream2
-display gtk,gl=on
-object memory-backend-memfd,id=mem1,size=4096M
-machine q35,accel=kvm,memory-backend=mem1 ...
- Run Gnome Wayland with the following options in the Guest VM:
# cat /usr/lib/udev/rules.d/61-mutter-primary-gpu.rules
ENV{DEVNAME}=="/dev/dri/card1", TAG+="mutter-device-preferred-primary", TAG+="mutter-device-disable-kms-modifiers"
# XDG_SESSION_TYPE=wayland dbus-run-session -- /usr/bin/gnome-shell --wayland --no-x11 &
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Leon Romanovsky <leonro(a)nvidia.com>
Cc: Christian Koenig <christian.koenig(a)amd.com>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Simona Vetter <simona.vetter(a)ffwll.ch>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Dongwon Kim <dongwon.kim(a)intel.com>
Vivek Kasireddy (8):
dma-buf: Add support for map/unmap APIs for interconnects
dma-buf: Add a helper to match interconnects between exporter/importer
dma-buf: Create and expose IOV interconnect to all exporters/importers
vfio/pci/dmabuf: Add support for IOV interconnect
drm/xe/dma_buf: Add support for IOV interconnect
drm/xe/pf: Add a helper function to get a VF's backing object in LMEM
drm/xe/bo: Create new dma_addr array for dmabuf BOs associated with
VFs
drm/xe/pt: Add an additional check for dmabuf BOs while doing bind
drivers/dma-buf/Makefile | 2 +-
drivers/dma-buf/dma-buf-interconnect.c | 164 +++++++++++++++++++++
drivers/dma-buf/dma-buf.c | 12 +-
drivers/gpu/drm/xe/xe_bo.c | 162 ++++++++++++++++++--
drivers/gpu/drm/xe/xe_bo_types.h | 6 +
drivers/gpu/drm/xe/xe_dma_buf.c | 17 ++-
drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c | 24 +++
drivers/gpu/drm/xe/xe_gt_sriov_pf_config.h | 1 +
drivers/gpu/drm/xe/xe_pt.c | 8 +
drivers/gpu/drm/xe/xe_sriov_pf_types.h | 19 +++
drivers/vfio/pci/vfio_pci_dmabuf.c | 135 ++++++++++++++++-
include/linux/dma-buf-interconnect.h | 122 +++++++++++++++
include/linux/dma-buf.h | 41 ++++++
13 files changed, 691 insertions(+), 22 deletions(-)
create mode 100644 drivers/dma-buf/dma-buf-interconnect.c
create mode 100644 include/linux/dma-buf-interconnect.h
--
2.50.1
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
The branch has various modifications to Leon's series I've suggested.
Jason Gunthorpe (8):
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 74 +++-
drivers/iommu/iommufd/io_pagetable.h | 53 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 13 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 142 ++++++
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
10 files changed, 741 insertions(+), 63 deletions(-)
base-commit: fc882154e421f82677925d33577226e776bb07a4
--
2.43.0
This series adds AF_XDP zero coppy support to icssg driver.
Tests were performed on AM64x-EVM with xdpsock application [1].
A clear improvement is seen Transmit (txonly) and receive (rxdrop)
for 64 byte packets. 1500 byte test seems to be limited by line
rate (1G link) so no improvement seen there in packet rate
Having some issue with l2fwd as the benchmarking numbers show 0
for 64 byte packets after forwading first batch packets and I am
currently looking into it.
AF_XDP performance using 64 byte packets in Kpps.
Benchmark: XDP-SKB XDP-Native XDP-Native(ZeroCopy)
rxdrop 259 462 645
txonly 350 354 760
l2fwd 178 240 0
AF_XDP performance using 1500 byte packets in Kpps.
Benchmark: XDP-SKB XDP-Native XDP-Native(ZeroCopy)
rxdrop 82 82 82
txonly 81 82 82
l2fwd 81 82 82
[1]: https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-example
v3: https://lore.kernel.org/all/20251014105613.2808674-1-m-malladi@ti.com/
v4-v3:
- Rebased to the latest tip
Meghana Malladi (6):
net: ti: icssg-prueth: Add functions to create and destroy Rx/Tx
queues
net: ti: icssg-prueth: Add XSK pool helpers
net: ti: icssg-prueth: Add AF_XDP zero copy for TX
net: ti: icssg-prueth: Make emac_run_xdp function independent of page
net: ti: icssg-prueth: Add AF_XDP zero copy for RX
net: ti: icssg-prueth: Enable zero copy in XDP features
drivers/net/ethernet/ti/icssg/icssg_common.c | 471 ++++++++++++++++---
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 394 +++++++++++++---
drivers/net/ethernet/ti/icssg/icssg_prueth.h | 25 +-
3 files changed, 741 insertions(+), 149 deletions(-)
base-commit: d550d63d0082268a31e93a10c64cbc2476b98b24
--
2.43.0
The Mesa issue referenced below pointed out a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1
[ 1231.611034] ---- ----
[ 1231.611035] lock(&xa->xa_lock#17);
[ 1231.611038] local_irq_disable();
[ 1231.611039] lock(&fence->lock);
[ 1231.611041] lock(&xa->xa_lock#17);
[ 1231.611044] <Interrupt>
[ 1231.611045] lock(&fence->lock);
[ 1231.611047]
*** DEADLOCK ***
In this example, CPU0 would be any function accessing job->dependencies
through the xa_* functions that doesn't disable interrupts (eg:
drm_sched_job_add_dependency, drm_sched_entity_kill_jobs_cb).
CPU1 is executing drm_sched_entity_kill_jobs_cb as a fence signalling
callback so in an interrupt context. It will deadlock when trying to
grab the xa_lock which is already held by CPU0.
Replacing all xa_* usage by their xa_*_irq counterparts would fix
this issue, but Christian pointed out another issue: dma_fence_signal
takes fence.lock and so does dma_fence_add_callback.
dma_fence_signal() // locks f1.lock
-> drm_sched_entity_kill_jobs_cb()
-> foreach dependencies
-> dma_fence_add_callback() // locks f2.lock
This will deadlock if f1 and f2 share the same spinlock.
To fix both issues, the code iterating on dependencies and re-arming them
is moved out to drm_sched_entity_kill_jobs_work.
Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Suggested-by: Christian König <christian.koenig(a)amd.com>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c8e949f4a568..fe174a4857be 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
}
EXPORT_SYMBOL(drm_sched_entity_error);
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+ struct dma_fence_cb *cb);
+
static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
{
struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
-
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
-}
-
-/* Signal the scheduler finished fence when the entity in question is killed. */
-static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
- struct dma_fence_cb *cb)
-{
- struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
- finish_cb);
+ struct dma_fence *f;
unsigned long index;
- dma_fence_put(f);
-
/* Wait for all dependencies to avoid data corruptions */
xa_for_each(&job->dependencies, index, f) {
struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
@@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
dma_fence_put(f);
}
+ drm_sched_fence_scheduled(job->s_fence, NULL);
+ drm_sched_fence_finished(job->s_fence, -ESRCH);
+ WARN_ON(job->s_fence->parent);
+ job->sched->ops->free_job(job);
+}
+
+/* Signal the scheduler finished fence when the entity in question is killed. */
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+ struct dma_fence_cb *cb)
+{
+ struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
+ finish_cb);
+
+ dma_fence_put(f);
+
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
}
--
2.43.0
Changelog:
v5:
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory
allocation
PCI/P2PDMA: Export pci_p2pdma_map_type() function
types: move phys_vec definition to common header
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature
functions
block/blk-mq-dma.c | 7 +-
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 175 ++++++++---
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/vfio_pci_config.c | 22 +-
drivers/vfio/pci/vfio_pci_core.c | 63 ++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 ++
drivers/vfio/vfio_main.c | 2 +
include/linux/pci-p2pdma.h | 120 +++++---
include/linux/types.h | 5 +
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 25 ++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
17 files changed, 785 insertions(+), 121 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
--
2.51.0
On Wed, Oct 29, 2025, at 18:50, Alex Mastro wrote:
> On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
>> + /*
>> + * dma_buf_fd() consumes the reference, when the file closes the dmabuf
>> + * will be released.
>> + */
>> + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
>
> I think this still needs to unwind state on fd allocation error. Reference
> ownership is only transferred on success.
Yes, you are correct, i need to call to dma_buf_put() in case of error. I will fix.
Thanks
>
>> +
>> +err_dev_put:
>> + vfio_device_put_registration(&vdev->vdev);
>> +err_free_phys:
>> + kfree(priv->phys_vec);
>> +err_free_priv:
>> + kfree(priv);
>> +err_free_ranges:
>> + kfree(dma_ranges);
>> + return ret;
>> +}