From: Leon Romanovsky <leonro(a)nvidia.com>
---------------------------------------------------------------------------
Based on blk and DMA patches which will be sent during coming merge window.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO regions
from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct
page memory with controlled lifetime management. This allows RDMA and other
subsystems to import dma-buf FDs and build them into memory regions for PCI
P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
This is based on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
Leon Romanovsky (8):
PCI/P2PDMA: Remove redundant bus_offset from map state
PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner
abstraction
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory
allocation
PCI/P2PDMA: Export pci_p2pdma_map_type() function
types: move phys_vec definition to common header
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature
functions
block/blk-mq-dma.c | 7 +-
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 144 +++++++++----
drivers/vfio/pci/Kconfig | 20 ++
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/vfio_pci_config.c | 22 +-
drivers/vfio/pci/vfio_pci_core.c | 59 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 1 +
include/linux/pci-p2pdma.h | 114 +++++-----
include/linux/types.h | 5 +
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 4 +
include/uapi/linux/vfio.h | 19 ++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
18 files changed, 631 insertions(+), 124 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
--
2.50.1
On Mon, Jul 28, 2025 at 11:07:34AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-28 10:41, Leon Romanovsky wrote:
> > On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> >>> On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>>>
> >>>>
> >>>> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>>>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>>>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>>>>>
> >>>>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>>>> transfers between a provider and target device.
> >>>>>>
> >>>>>> External modules have no business doing this.
> >>>>>
> >>>>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>>>> without exporting functions in it.
> >>>>
> >>>> The solution that would make more sense to me would be for either
> >>>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >>>> P2PDMA case.
> >>>
> >>> This has nothing to do with dma-iommu.c, the decisions here still need
> >>> to be made even if dma-iommu.c is not compiled in.
> >>
> >> Doesn't it though? Every single call in patch 10 to the newly exported
> >> PCI functions calls into the the dma-iommu functions.
Patch 10 has lots of flows, only one will end up in dma-iommu.c
vfio_pci_dma_buf_map() calls pci_p2pdma_bus_addr_map(),
dma_iova_link(), dma_map_phys().
Only iova_link would call to dma-iommu.c - if dma_map_phys() is called
we know that dma-iommu.c won't be called by it.
> >> If there were non-iommu paths then I would expect the code would
> >> use the regular DMA api directly which would then call in to
> >> dma-iommu.
> >
> > If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
> > at all.
>
> I understand that and it is completely beside my point.
>
> If the dma mapping for P2P memory doesn't need to create an iommu
> mapping then that's fine. But it should be the dma-iommu layer to decide
> that.
So above, we can't use dma-iommu.c, it might not be compiled into the
kernel but the dma_map_phys() path is still valid.
> It's not a decision that should be made by every driver doing this
> kind of thing.
Sort of, I think we are trying to get to some place where there are
subsystem, or at least data structure specific helpers that do this
(ie nvme has BIO helpers), but the helpers should be running this
logic directly for performance. Leon hasn't done it but I think we
should see helpers for DMABUF too encapsulating the logic shown in
patch 10. I think we need to prove it out these basic points first
before trying to go and convert a bunch of GPU drivers.
The vfio in patch 10 is not the full example since it only has a
single scatter/gather" effectively, but the generalized version loops
over pci_p2pdma_bus_addr_map(), dma_iova_link(), dma_map_phys() for
each page.
Part of the new API design is to only do one kind of mapping operation
at once, and part of the design is we know that the P2P type is fixed.
It makes no performance sense to check the type inside the
pci_p2pdma_bus_addr_map()/ dma_iova_link()/dma_map_phys() within the
per-page loop.
I do think some level of abstraction has been lost here in pursuit of
performance. If someone does have a better way to structure this
without a performance hit then fantastic, but thats going back and
revising the new DMA API. This just builds on top of that, and yes, it
is not so abstract.
Jason
On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> > On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>>>
> >>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>> transfers between a provider and target device.
> >>>>
> >>>> External modules have no business doing this.
> >>>
> >>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>> without exporting functions in it.
> >>
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case.
> >
> > This has nothing to do with dma-iommu.c, the decisions here still need
> > to be made even if dma-iommu.c is not compiled in.
>
> Doesn't it though? Every single call in patch 10 to the newly exported
> PCI functions calls into the the dma-iommu functions. If there were
> non-iommu paths then I would expect the code would use the regular DMA
> api directly which would then call in to dma-iommu.
If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
at all.
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+ struct dma_buf_attachment *attachment)
+{
+ struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+ if (!attachment->peer2peer)
+ return -EOPNOTSUPP;
+
+ if (priv->revoked)
+ return -ENODEV;
+
+ switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) {
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ break;
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ /*
+ * There is no need in IOVA at all for this flow.
+ * We rely on attachment->priv == NULL as a marker
+ * for this mode.
+ */
+ return 0;
+ default:
+ return -EINVAL;
+ }
+
+ attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
+ if (!attachment->priv)
+ return -ENOMEM;
+
+ dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->phys_vec.len);
+ return 0;
+}
We've discussed a number of times of how some heap names are bad, but
not really what makes a good heap name.
Let's document what we expect the heap names to look like.
Reviewed-by: Andrew Davis <afd(a)ti.com>
Reviewed-by: Bagas Sanjaya <bagasdotme(a)gmail.com>
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v4:
- Dropped *all* the cacheable mentions
- Link to v3: https://lore.kernel.org/r/20250717-dma-buf-heap-names-doc-v3-1-d2dbb4b95ef6…
Changes in v3:
- Grammar, spelling fixes
- Remove the cacheable / uncacheable name suggestion
- Link to v2: https://lore.kernel.org/r/20250616-dma-buf-heap-names-doc-v2-1-8ae43174cdbf…
Changes in v2:
- Added justifications for each requirement / suggestions
- Added a mention and example of buffer attributes
- Link to v1: https://lore.kernel.org/r/20250520-dma-buf-heap-names-doc-v1-1-ab31f74809ee…
---
Documentation/userspace-api/dma-buf-heaps.rst | 35 +++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 535f49047ce6450796bf4380c989e109355efc05..1ced2720f929432661182f1a3a88aa1ff80bd6af 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -21,5 +21,40 @@ following heaps:
usually created either through the kernel commandline through the
`cma` parameter, a memory region Device-Tree node with the
`linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or
`CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it
might be called ``reserved``, ``linux,cma``, or ``default-pool``.
+
+Naming Convention
+=================
+
+``dma-buf`` heaps name should meet a number of constraints:
+
+- The name must be stable, and must not change from one version to the other.
+ Userspace identifies heaps by their name, so if the names ever change, we
+ would be likely to introduce regressions.
+
+- The name must describe the memory region the heap will allocate from, and
+ must uniquely identify it in a given platform. Since userspace applications
+ use the heap name as the discriminant, it must be able to tell which heap it
+ wants to use reliably if there's multiple heaps.
+
+- The name must not mention implementation details, such as the allocator. The
+ heap driver will change over time, and implementation details when it was
+ introduced might not be relevant in the future.
+
+- The name should describe properties of the buffers that would be allocated.
+ Doing so will make heap identification easier for userspace. Such properties
+ are:
+
+ - ``contiguous`` for physically contiguous buffers;
+
+ - ``protected`` for encrypted buffers not accessible the OS;
+
+- The name may describe intended usage. Doing so will make heap identification
+ easier for userspace applications and users.
+
+For example, assuming a platform with a reserved memory region located
+at the RAM address 0x42000000, intended to allocate video framebuffers,
+physically contiguous, and backed by the CMA kernel allocator, good
+names would be ``memory@42000000-contiguous`` or ``video@42000000``, but
+``cma-video`` wouldn't.
---
base-commit: 038d61fd642278bab63ee8ef722c50d10ab01e8f
change-id: 20250520-dma-buf-heap-names-doc-31261aa0cfe6
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
We've discussed a number of times of how some heap names are bad, but
not really what makes a good heap name.
Let's document what we expect the heap names to look like.
Reviewed-by: Bagas Sanjaya <bagasdotme(a)gmail.com>
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v3:
- Grammar, spelling fixes
- Remove the cacheable / uncacheable name suggestion
- Link to v2: https://lore.kernel.org/r/20250616-dma-buf-heap-names-doc-v2-1-8ae43174cdbf…
Changes in v2:
- Added justifications for each requirement / suggestions
- Added a mention and example of buffer attributes
- Link to v1: https://lore.kernel.org/r/20250520-dma-buf-heap-names-doc-v1-1-ab31f74809ee…
---
Documentation/userspace-api/dma-buf-heaps.rst | 35 +++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 535f49047ce6450796bf4380c989e109355efc05..3ee4e7961fe390ba356a2125d53b060546c3e4a6 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -21,5 +21,40 @@ following heaps:
usually created either through the kernel commandline through the
`cma` parameter, a memory region Device-Tree node with the
`linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or
`CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it
might be called ``reserved``, ``linux,cma``, or ``default-pool``.
+
+Naming Convention
+=================
+
+``dma-buf`` heaps name should meet a number of constraints:
+
+- The name must be stable, and must not change from one version to the other.
+ Userspace identifies heaps by their name, so if the names ever change, we
+ would be likely to introduce regressions.
+
+- The name must describe the memory region the heap will allocate from, and
+ must uniquely identify it in a given platform. Since userspace applications
+ use the heap name as the discriminant, it must be able to tell which heap it
+ wants to use reliably if there's multiple heaps.
+
+- The name must not mention implementation details, such as the allocator. The
+ heap driver will change over time, and implementation details when it was
+ introduced might not be relevant in the future.
+
+- The name should describe properties of the buffers that would be allocated.
+ Doing so will make heap identification easier for userspace. Such properties
+ are:
+
+ - ``contiguous`` for physically contiguous buffers;
+
+ - ``protected`` for encrypted buffers not accessible the OS;
+
+- The name may describe intended usage. Doing so will make heap identification
+ easier for userspace applications and users.
+
+For example, assuming a platform with a reserved memory region located at the
+RAM address 0x42000000, intended to allocate video framebuffers, physically
+contiguous, and backed by the CMA kernel allocator, good names would be
+``memory@42000000-cacheable-contiguous`` or ``video@42000000``, but
+``cma-video`` wouldn't.
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250520-dma-buf-heap-names-doc-31261aa0cfe6
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> >
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
>
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case.
This has nothing to do with dma-iommu.c, the decisions here still need
to be made even if dma-iommu.c is not compiled in.
It could be exported from the main dma code, but I think it would just
be a 1 line wrapper around the existing function? I'd rather rename
the functions and leave them in the p2pdma.c files...
Jason
On Fri, Jul 25, 2025 at 05:34:40AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
>
> > Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> >
> > > >
> > > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > > >
> > > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > > enabling safe sharing of non-struct page memory with controlled
> > > > lifetime management. This allows RDMA and other subsystems to
> > import
> > > > dma-buf FDs and build them into memory regions for PCI P2P
> > operations.
> > > >
> > > > The implementation provides a revocable attachment mechanism using
> > > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > > don't change physical addresses, but access is revoked when the VFIO
> > > > device is closed or a PCI reset is issued. This ensures kernel
> > > > self-defense against potentially hostile userspace.
> > > >
> > > > Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
> > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
> > > > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > > > ---
> > > > drivers/vfio/pci/Kconfig | 20 ++
> > > > drivers/vfio/pci/Makefile | 2 +
> > > > drivers/vfio/pci/vfio_pci_config.c | 22 +-
> > > > drivers/vfio/pci/vfio_pci_core.c | 25 ++-
> > > > drivers/vfio/pci/vfio_pci_dmabuf.c | 321
> > +++++++++++++++++++++++++++++
> > > > drivers/vfio/pci/vfio_pci_priv.h | 23 +++
> > > > include/linux/dma-buf.h | 1 +
> > > > include/linux/vfio_pci_core.h | 3 +
> > > > include/uapi/linux/vfio.h | 19 ++
> > > > 9 files changed, 431 insertions(+), 5 deletions(-)
> > > > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
> >
> > <...>
> >
> > > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > > > + struct vfio_device_feature_dma_buf
> > *dma_buf)
> > > > +{
> > > > + struct pci_dev *pdev = vdev->pdev;
> > > > + u32 bar = dma_buf->region_index;
> > > > + u64 offset = dma_buf->offset;
> > > > + u64 len = dma_buf->length;
> > > > + resource_size_t bar_size;
> > > > + u64 sum;
> > > > +
> > > > + /*
> > > > + * For PCI the region_index is the BAR number like everything else.
> > > > + */
> > > > + if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > > > + return -ENODEV;
> >
> > <...>
> >
> > > > +/**
> > > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > > + * regions selected.
> > > > + *
> > > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > > > O_CLOEXEC,
> > > > + * etc. offset/length specify a slice of the region to create the dmabuf
> > from.
> > > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > > > dmabuf.
> > > Any particular reason why you dropped the option (nr_ranges) of creating
> > a
> > > single dmabuf from multiple ranges of an MMIO region?
> >
> > I did it for two reasons. First, I wanted to simplify the code in order
> > to speed-up discussion over the patchset itself. Second, I failed to
> > find justification for need of multiple ranges, as the number of BARs
> > are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
> > can be achieved by multiple calls to DMABUF import.
> I don't think the same functionality can be achieved by multiple calls to
> dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can
> have multiple entries because it represents a scattered buffer (multiple
> non-contiguous entries in System RAM or an MMIO region).
I don't know all the reasons why SG was chosen, but one of the main
reasons is that DMA SG API was the only one possible way to handle p2p
transfers (peer2peer flag).
> But in this patch you are constraining it such that only one entry associated with a
> buffer would be included, which effectively means that we cannot create
> a dmabuf to represent scattered buffers (located in a single MMIO region
> such as VRAM or other device memory) anymore.
Yes
>
> >
> > >
> > > Restricting the dmabuf to a single range (or having to create multiple
> > dmabufs
> > > to represent multiple regions/ranges associated with a single scattered
> > buffer)
> > > would be very limiting and may not work in all cases. For instance, in my
> > use-case,
> > > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's
> > VRAM
> > > between two (p2p compatible) GPU devices. And, this would probably not
> > work
> > > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may
> > not be
> > > feasible when there is memory pressure.
> >
> > Can you please help me and point to the place in the code where this can
> > fail?
> > I'm probably missing something basic as there are no large allocations
> > in the current patchset.
> Sorry, I was not very clear. What I meant is that it is not prudent to assume that
> there will only be one range associated with an MMIO region which we need to
> consider while creating a dmabuf. And, I was pointing out my use-case as an
> example where vfio-pci needs to create a dmabuf for a large buffer (FB) that
> would likely be scattered (and not contiguous) in an MMIO region (such as VRAM).
>
> Let me further explain with my use-case. Here is a link to my Qemu-based test:
> https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f…
Ohh, thanks. I'll add nr_ranges in next version. I see that you are
using same region_index for all ranges and this is how I would like to
keep it: "multiple nr_ranges, same region_index".
Thanks
On Fri, Jul 25, 2025 at 01:12:35PM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-25 12:54, Leon Romanovsky wrote:
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> >> there would be no need to export the low level helpers from the p2pdma code.
> >
> > I had same idea in early versions of DMA phys API discussion and it was
> > pointed (absolutely right) that this is layering violation.
>
> Respectfully, I have to disagree with this. Having the layer (ie.
> dma-iommu) that normally checks how to handle a P2PDMA request now check
> how to handle these DMA requests is the exact opposite of a layering
> violation.
I'm aware of your implementation and have feeling that it was very
influenced by NVMe requirements, so the end result is very tailored
for it. Other users have very different paths if p2p is taken. Just
see last VFIO patch in this series, it skips all DMA logic.
> Expecting every driver that wants to do P2PDMA to have to
> figure out for themselves how to map the memory before calling into the
> DMA API doesn't seem like a good design choice to me.
We had this discussion earlier too on previous versions. The summary is
that p2p capable devices are very special anyway. They need to work with
p2p natively. BTW, the implementation is not supposed to be in the
drivers, but in their respective subsystems.
>
> > So unfortunately, I think that dma*.c|h is not right place for p2p
> > type check.
>
> dma*.c is already where those checks are done. I'm not sure patches to
> remove the code from that layer and put it into the NVMe driver would
> make a lot of sense (and then, of course, we'd have to put it into every
> other driver that wants to participate in p2p transactions).
I don't have plans to remove existing checks right now, but NVMe was already
converted to new DMA phys API.
Thanks
>
> Logan
>
>
On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> >
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
>
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> there would be no need to export the low level helpers from the p2pdma code.
I had same idea in early versions of DMA phys API discussion and it was
pointed (absolutely right) that this is layering violation.
At that time, that remark wasn't such clear to me because HMM code
performs check for p2p on every page and has call to dma_iova_try_alloc()
before that check. But this VFIO DMABUF code shows it much more clearer.
The p2p check is performed before any DMA calls and in case of PCI_P2PDMA_MAP_BUS_ADDR
p2p type between DMABUF exporter device and DMABUF importer device, we
don't call dma_iova_try_alloc() or any DMA API at all.
So unfortunately, I think that dma*.c|h is not right place for p2p
type check.
Thanks
>
> Logan
>
On Thu, Jul 24, 2025 at 05:13:49AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
>
> > Subject: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> >
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > Add support for exporting PCI device MMIO regions through dma-buf,
> > enabling safe sharing of non-struct page memory with controlled
> > lifetime management. This allows RDMA and other subsystems to import
> > dma-buf FDs and build them into memory regions for PCI P2P operations.
> >
> > The implementation provides a revocable attachment mechanism using
> > dma-buf move operations. MMIO regions are normally pinned as BARs
> > don't change physical addresses, but access is revoked when the VFIO
> > device is closed or a PCI reset is issued. This ensures kernel
> > self-defense against potentially hostile userspace.
> >
> > Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
> > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > ---
> > drivers/vfio/pci/Kconfig | 20 ++
> > drivers/vfio/pci/Makefile | 2 +
> > drivers/vfio/pci/vfio_pci_config.c | 22 +-
> > drivers/vfio/pci/vfio_pci_core.c | 25 ++-
> > drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
> > drivers/vfio/pci/vfio_pci_priv.h | 23 +++
> > include/linux/dma-buf.h | 1 +
> > include/linux/vfio_pci_core.h | 3 +
> > include/uapi/linux/vfio.h | 19 ++
> > 9 files changed, 431 insertions(+), 5 deletions(-)
> > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
<...>
> > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > + struct vfio_device_feature_dma_buf *dma_buf)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u32 bar = dma_buf->region_index;
> > + u64 offset = dma_buf->offset;
> > + u64 len = dma_buf->length;
> > + resource_size_t bar_size;
> > + u64 sum;
> > +
> > + /*
> > + * For PCI the region_index is the BAR number like everything else.
> > + */
> > + if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > + return -ENODEV;
<...>
> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > + * regions selected.
> > + *
> > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > O_CLOEXEC,
> > + * etc. offset/length specify a slice of the region to create the dmabuf from.
> > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > dmabuf.
> Any particular reason why you dropped the option (nr_ranges) of creating a
> single dmabuf from multiple ranges of an MMIO region?
I did it for two reasons. First, I wanted to simplify the code in order
to speed-up discussion over the patchset itself. Second, I failed to
find justification for need of multiple ranges, as the number of BARs
are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
can be achieved by multiple calls to DMABUF import.
>
> Restricting the dmabuf to a single range (or having to create multiple dmabufs
> to represent multiple regions/ranges associated with a single scattered buffer)
> would be very limiting and may not work in all cases. For instance, in my use-case,
> I am trying to share a large (4k mode) framebuffer (FB) located in GPU's VRAM
> between two (p2p compatible) GPU devices. And, this would probably not work
> given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may not be
> feasible when there is memory pressure.
Can you please help me and point to the place in the code where this can fail?
I'm probably missing something basic as there are no large allocations
in the current patchset.
>
> Furthermore, since you are adding a new UAPI with this patch/feature, as you know,
> we cannot go back and tweak it (to add support for nr_ranges > 1) should there
> be a need in the future, but you can always use nr_ranges = 1 anytime. Therefore,
> I think it makes sense to be flexible in terms of the number of ranges to include
> while creating a dmabuf instead of restricting ourselves to one range.
I'm not a big fan of over-engineering. Let's first understand if this
case is needed.
Thanks
>
> Thanks,
> Vivek
>
> > + *
> > + * Return: The fd number on success, -1 and errno is set on failure.
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > +
> > +struct vfio_device_feature_dma_buf {
> > + __u32 region_index;
> > + __u32 open_flags;
> > + __u64 offset;
> > + __u64 length;
> > +};
> > +
> > /* -------- API for Type1 VFIO IOMMU -------- */
> >
> > /**
> > --
> > 2.50.1
>