On 3/12/26 19:45, Matt Evans wrote:
Hi all,
There were various suggestions in the September 2025 thread "[TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes" [0], and LPC discussions, around improving the situation for multi-process userspace driver designs. This RFC series implements some of these ideas.
(Thanks for feedback on v1! Revised series, with changes noted inline.)
Background: Multi-process USDs
The userspace driver scenario discussed in that thread involves a primary process driving a PCIe function through VFIO/iommufd, which manages the function-wide ownership/lifecycle. The function is designed to provide multiple distinct programming interfaces (for example, several independent MMIO register frames in one function), and the primary process delegates control of these interfaces to multiple independent client processes (which do the actual work). This scenario clearly relies on a HW design that provides appropriate isolation between the programming interfaces.
The two key needs are:
Mechanisms to safely delegate a subset of the device MMIO resources to a client process without over-sharing wider access (or influence over whole-device activities, such as reset).
Mechanisms to allow a client process to do its own iommufd management w.r.t. its address space, in a way that's isolated from DMA relating to other clients.
mmap() of VFIO DMABUFs
This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF", implementing the proposals in [0] to add mmap() support to the existing VFIO DMABUF exporter.
This enables a userspace driver to define DMABUF ranges corresponding to sub-ranges of a BAR, and grant a given client (via a shared fd) the capability to access (only) those sub-ranges. The VFIO device fds would be kept private to the primary process. All the client can do with that fd is map (or iomap via iommufd) that specific subset of resources, and the impact of bugs/malice is contained.
(We'll follow up on #2 separately, as a related-but-distinct problem. PASIDs are one way to achieve per-client isolation of DMA; another could be sharing of a single IOVA space via 'constrained' iommufds.)
New in v2: To achieve this, the existing VFIO BAR mmap() path is converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR mmap() to use a DMABUF" plus new helper functions, as Jason/Christian suggested in the v1 discussion [3].
This means:
Both regular and new DMABUF BAR mappings share the same vm_ops, i.e. mmap()ing DMABUFs is a smaller change on top of the existing mmap().
The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the vfio_pci_zap_bars() originally paired with the _move()s can go away. Each DMABUF has a unique address_space.
It's a step towards future iommufd VFIO Type1 emulation implementing P2P, since iommufd can now get a DMABUF from a VA that it's mapping for IO; the VMAs' vm_file is that of the backing DMABUF.
Revocation/reclaim
Mapping a BAR subset is useful, but the lifetime of access granted to a client needs to be managed well. For example, a protocol between the primary process and the client can indicate when the client is done, and when it's safe to reuse the resources elsewhere, but cleanup can't practically be cooperative.
For robustness, we enable the driver to make the resources guaranteed-inaccessible when it chooses, so that it can re-assign them to other uses in future.
"vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE. This takes a DMABUF fd parameter previously exported (from that device!) and permanently revokes the DMABUF. This notifies/detaches importers, zaps PTEs for any mappings, and guarantees no future attachment/import/map/access is possible by any means.
A primary driver process would use this operation when the client's tenure ends to reclaim "loaned-out" MMIO interfaces, at which point the interfaces could be safely re-used.
New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd. A DMABUF is revoked using code common to vfio_pci_dma_buf_move(), selectively zapping mappings (after waiting for completion on the dma_buf_invalidate_mappings() request).
BAR mapping access attributes
Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's work in [1] with the goal of controlling CPU access attributes for VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access attributes that are then used by a mapping's PTEs.
I've proposed reserving a field in struct vfio_device_feature_dma_buf's flags to specify an attribute for its ranges. Although that keeps the (UAPI) struct unchanged, it means all ranges in a DMABUF share the same attribute. I feel a single attribute-to-mmap() relation is logical/reasonable. An application can also create multiple DMABUFs to describe any BAR layout and mix of attributes.
Tests
(Still sharing the [RFC ONLY] userspace test/demo program for context, not for merge.)
It illustrates & tests various map/revoke cases, but doesn't use the existing VFIO selftests and relies on a (tweaked) QEMU EDU function. I'm (still) working on integrating the scenarios into the existing VFIO selftests.
This code has been tested in mapping DMABUFs of single/multiple ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff > 0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem to work correctly. I've lightly tested WC mappings also (by observing resulting PTEs as having the correct attributes...).
Fin
v2 is based on next-20260310 (to build on Leon's recent series "vfio: Wait for dma-buf invalidation to complete" [2]).
Please share your thoughts! I'd like to de-RFC if we feel this approach is now fair.
I only skimmed over it, but at least of hand I couldn't find anything fundamentally wrong.
The locking order seems to change in patch #6. In general I strongly recommend to enable lockdep while testing anyway but explicitly when I see such changes.
Additional to that it might also be a good idea to have a lockdep initcall function which defines the locking order in the way all the VFIO code should follow.
See function dma_resv_lockdep() for an example on how to do that. Especially with mmap support and all the locks involved with that it has proven to be a good practice to have something like that.
Regards, Christian.
Many thanks,
Matt
References:
Changelog:
v2: Respin based on the feedback/suggestions:
Transform the existing VFIO BAR mmap path to also use DMABUFs behind the scenes, and then simply share that code for explicitly-mapped DMABUFs.
Refactors the export itself out of vfio_pci_core_feature_dma_buf, and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by the regular VFIO mmap to create a DMABUF.
Revoke buffers using a VFIO device fd ioctl
v1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (10): vfio/pci: Set up VFIO barmap before creating a DMABUF vfio/pci: Clean up DMABUFs before disabling function vfio/pci: Add helper to look up PFNs for DMABUFs vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA vfio/pci: Convert BAR mmap() to use a DMABUF vfio/pci: Remove vfio_pci_zap_bars() vfio/pci: Support mmap() of a VFIO DMABUF vfio/pci: Permanently revoke a DMABUF on request vfio/pci: Add mmap() attributes to DMABUF feature [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
drivers/vfio/pci/Kconfig | 3 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_config.c | 18 +- drivers/vfio/pci/vfio_pci_core.c | 123 +-- drivers/vfio/pci/vfio_pci_dmabuf.c | 425 +++++++-- drivers/vfio/pci/vfio_pci_priv.h | 46 +- include/uapi/linux/vfio.h | 42 +- tools/testing/selftests/vfio/Makefile | 1 + .../vfio/standalone/vfio_dmabuf_mmap_test.c | 837 ++++++++++++++++++ 9 files changed, 1339 insertions(+), 159 deletions(-) create mode 100644 tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c