 
            From: Leon Romanovsky leonro@nvidia.com
--------------------------------------------------------------------------- Based on blk and DMA patches which will be sent during coming merge window. ---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO regions from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. This dmabuf approach can be usable by iommufd as well for generic and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added in this patch series can also be useful when a buffer (located in device memory such as VRAM) needs to be shared between any two dGPU devices or instances (assuming one of them is bound to VFIO PCI) as long as they are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem to separate core P2P functionality from memory allocation features, making it more modular and suitable for VFIO use cases that don't need struct page support.
----------------------------------------------------------------------- This is based on https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.co... but heavily rewritten to be based on DMA physical API. ----------------------------------------------------------------------- The WIP branch can be found here: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=d...
Thanks
Leon Romanovsky (8): PCI/P2PDMA: Remove redundant bus_offset from map state PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction PCI/P2PDMA: Simplify bus address mapping API PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation PCI/P2PDMA: Export pci_p2pdma_map_type() function types: move phys_vec definition to common header vfio/pci: Enable peer-to-peer DMA transactions by default vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2): vfio: Export vfio device get and put registration helpers vfio/pci: Share the core device pointer while invoking feature functions
block/blk-mq-dma.c | 7 +- drivers/iommu/dma-iommu.c | 4 +- drivers/pci/p2pdma.c | 144 +++++++++---- drivers/vfio/pci/Kconfig | 20 ++ drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/vfio_pci_config.c | 22 +- drivers/vfio/pci/vfio_pci_core.c | 59 ++++-- drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ drivers/vfio/vfio_main.c | 2 + include/linux/dma-buf.h | 1 + include/linux/pci-p2pdma.h | 114 +++++----- include/linux/types.h | 5 + include/linux/vfio.h | 2 + include/linux/vfio_pci_core.h | 4 + include/uapi/linux/vfio.h | 19 ++ kernel/dma/direct.c | 4 +- mm/hmm.c | 2 +- 18 files changed, 631 insertions(+), 124 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
 
            From: Leon Romanovsky leonro@nvidia.com
Remove the bus_off field from pci_p2pdma_map_state since it duplicates information already available in the pgmap structure. The bus_offset is only used in one location (pci_p2pdma_bus_addr_map) and is always identical to pgmap->bus_offset.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 1 - include/linux/pci-p2pdma.h | 3 +-- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 8d955c25aed36..fe347ed7fd8f4 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -1009,7 +1009,6 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, { state->pgmap = page_pgmap(page); state->map = pci_p2pdma_map_type(state->pgmap, dev); - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; }
/** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 075c20b161d98..b502fc8b49bf9 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -146,7 +146,6 @@ enum pci_p2pdma_map_type { struct pci_p2pdma_map_state { struct dev_pagemap *pgmap; enum pci_p2pdma_map_type map; - u64 bus_off; };
/* helper for pci_p2pdma_state(), do not use directly */ @@ -186,7 +185,7 @@ static inline dma_addr_t pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) { WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->bus_off; + return paddr + to_p2p_pgmap(state->pgmap)->bus_offsetf; }
#endif /* _LINUX_PCI_P2P_H */
 
            Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
 
            From: Leon Romanovsky leonro@nvidia.com
Extract the core P2PDMA provider information (device owner and bus offset) from the dev_pagemap into a dedicated p2pdma_provider structure. This creates a cleaner separation between the memory management layer and the P2PDMA functionality.
The new p2pdma_provider structure contains: - owner: pointer to the providing device - bus_offset: computed offset for non-host transactions
This refactoring simplifies the P2PDMA state management by removing the need to access pgmap internals directly. The pci_p2pdma_map_state now stores a pointer to the provider instead of the pgmap, making the API more explicit and easier to understand.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 42 +++++++++++++++++++++----------------- include/linux/pci-p2pdma.h | 18 ++++++++++++---- 2 files changed, 37 insertions(+), 23 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index fe347ed7fd8f4..5a310026bd24f 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -28,9 +28,8 @@ struct pci_p2pdma { };
struct pci_p2pdma_pagemap { - struct pci_dev *provider; - u64 bus_offset; struct dev_pagemap pgmap; + struct p2pdma_provider mem; };
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page) { struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ - struct pci_p2pdma *p2pdma = - rcu_dereference_protected(pgmap->provider->p2pdma, 1); + struct pci_p2pdma *p2pdma = rcu_dereference_protected( + to_pci_dev(pgmap->mem.owner)->p2pdma, 1); struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
static void pci_p2pdma_unmap_mappings(void *data) { - struct pci_dev *pdev = data; + struct pci_p2pdma_pagemap *p2p_pgmap = data;
/* * Removing the alloc attribute from sysfs will call * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, + sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj, + &p2pmem_alloc_attr.attr, p2pmem_group.name); }
@@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - - p2p_pgmap->provider = pdev; - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - - pci_resource_start(pdev, bar); + p2p_pgmap->mem.owner = &pdev->dev; + p2p_pgmap->mem.bus_offset = + pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, }
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, - pdev); + p2p_pgmap); if (error) goto pages_free;
@@ -973,16 +972,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, - struct device *dev) +static enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; + struct pci_dev *pdev = to_pci_dev(provider->owner); struct pci_dev *client; struct pci_p2pdma *p2pdma; int dist;
- if (!provider->p2pdma) + if (!pdev->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED;
if (!dev_is_pci(dev)) @@ -991,7 +990,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, client = to_pci_dev(dev);
rcu_read_lock(); - p2pdma = rcu_dereference(provider->p2pdma); + p2pdma = rcu_dereference(pdev->p2pdma);
if (p2pdma) type = xa_to_value(xa_load(&p2pdma->map_types, @@ -999,7 +998,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, rcu_read_unlock();
if (type == PCI_P2PDMA_MAP_UNKNOWN) - return calc_map_type_and_dist(provider, client, &dist, true); + return calc_map_type_and_dist(pdev, client, &dist, true);
return type; } @@ -1007,8 +1006,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { - state->pgmap = page_pgmap(page); - state->map = pci_p2pdma_map_type(state->pgmap, dev); + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); + + if (state->mem == &p2p_pgmap->mem) + return; + + state->mem = &p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev); }
/** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index b502fc8b49bf9..27a2c399f47da 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -16,6 +16,16 @@ struct block_device; struct scatterlist;
+/** + * struct p2pdma_provider + * + * A p2pdma provider is a range of MMIO address space available to the CPU. + */ +struct p2pdma_provider { + struct device *owner; + u64 bus_offset; +}; + #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); @@ -144,10 +154,11 @@ enum pci_p2pdma_map_type { };
struct pci_p2pdma_map_state { - struct dev_pagemap *pgmap; + struct p2pdma_provider *mem; enum pci_p2pdma_map_type map; };
+ /* helper for pci_p2pdma_state(), do not use directly */ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page); @@ -166,8 +177,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { - if (state->pgmap != page_pgmap(page)) - __pci_p2pdma_update_state(state, dev, page); + __pci_p2pdma_update_state(state, dev, page); return state->map; } return PCI_P2PDMA_MAP_NONE; @@ -185,7 +195,7 @@ static inline dma_addr_t pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) { WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + to_p2p_pgmap(state->pgmap)->bus_offsetf; + return paddr + state->mem->bus_offset; }
#endif /* _LINUX_PCI_P2P_H */
 
            On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Extract the core P2PDMA provider information (device owner and bus offset) from the dev_pagemap into a dedicated p2pdma_provider structure. This creates a cleaner separation between the memory management layer and the P2PDMA functionality.
The new p2pdma_provider structure contains:
- owner: pointer to the providing device
- bus_offset: computed offset for non-host transactions
This refactoring simplifies the P2PDMA state management by removing the need to access pgmap internals directly. The pci_p2pdma_map_state now stores a pointer to the provider instead of the pgmap, making the API more explicit and easier to understand.
I really don't see how anything becomes cleaner or simpler here. It adds a new structure that only exists embedded in the exist one and more code for no apparent benefit.
 
            On Thu, Jul 24, 2025 at 09:51:45AM +0200, Christoph Hellwig wrote:
On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Extract the core P2PDMA provider information (device owner and bus offset) from the dev_pagemap into a dedicated p2pdma_provider structure. This creates a cleaner separation between the memory management layer and the P2PDMA functionality.
The new p2pdma_provider structure contains:
- owner: pointer to the providing device
- bus_offset: computed offset for non-host transactions
This refactoring simplifies the P2PDMA state management by removing the need to access pgmap internals directly. The pci_p2pdma_map_state now stores a pointer to the provider instead of the pgmap, making the API more explicit and easier to understand.
I really don't see how anything becomes cleaner or simpler here. It adds a new structure that only exists embedded in the exist one and more code for no apparent benefit.
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
Thanks
 
            On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work. So I guess you're trying to do the same stupid things shut down before again? I might as well not waste my time reviewing this.
 
            On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work. So I guess you're trying to do the same stupid things shut down before again? I might as well not waste my time reviewing this.
I'm not aware of anything that is not acceptable in this series.
This series focused on replacing dma_map_resource() call from v3 https://lore.kernel.org/all/20250307052248.405803-4-vivek.kasireddy@intel.co... to proper API.
92 if (!state) { 93 addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr); 94 } else if (dma_use_iova(state)) { 95 ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0, 96 phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC); 97 if (ret) 98 goto err_free_table; 99 100 ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len); 101 if (ret) 102 goto err_unmap_dma; 103 104 addr = state->addr; 105 } else { 106 addr = dma_map_phys(attachment->dev, phys_vec->paddr, 107 phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC); 108 ret = dma_mapping_error(attachment->dev, addr); 109 if (ret) 110 goto err_free_table; 111 }
Thanks
 
            On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work.
Why not?
That's the whole point of this, to remove struct page and use something else as a handle for the p2p when doing the DMA API stuff.
The caller must make sure the lifetimes all work out. The handle must live longer than any active DMAs, etc, etc. DMABUF with invalidation lets vfio do that.
This is why the DMA api code was taught to use phys_addr_t and not touch the struct page so it could work with struct-pageless memory.
The idea was to end up with two layers in the P2P code where the lower layer only works on the handle, and then there is an optional struct page/genalloc/etc layer for places that want struct page and mmap.
Jason
 
            On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work.
Why not?
That's the whole point of this, to remove struct page and use something else as a handle for the p2p when doing the DMA API stuff.
Because the struct page is the only thing that:
a) dma-mapping works on b) is the only place we can discover the routing information, but also more importantly ensure that the underlying page is still present and the device is not hot unplugged, or in a very theoretical worst case replaced by something else.
 
            On Tue, Jul 29, 2025 at 09:52:09AM +0200, Christoph Hellwig wrote:
On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work.
Why not?
That's the whole point of this, to remove struct page and use something else as a handle for the p2p when doing the DMA API stuff.
Because the struct page is the only thing that:
a) dma-mapping works on b) is the only place we can discover the routing information, but also more importantly ensure that the underlying page is still present and the device is not hot unplugged, or in a very theoretical worst case replaced by something else.
It is correct in general case, but here we are talking about MMIO memory, which is "connected" to device X and routing information is stable.
Thanks
 
            On Tue, Jul 29, 2025 at 11:53:36AM +0300, Leon Romanovsky wrote:
Because the struct page is the only thing that:
a) dma-mapping works on b) is the only place we can discover the routing information, but also more importantly ensure that the underlying page is still present and the device is not hot unplugged, or in a very theoretical worst case replaced by something else.
It is correct in general case, but here we are talking about MMIO memory, which is "connected" to device X and routing information is stable.
MMIO is literally the only thing we support to P2P to/from as that is how PCIe P2P is defined. And not, it's not stable - devices can be unplugged, and BARs can be reenumerated.
 
            On Tue, Jul 29, 2025 at 12:41:00PM +0200, Christoph Hellwig wrote:
On Tue, Jul 29, 2025 at 11:53:36AM +0300, Leon Romanovsky wrote:
Because the struct page is the only thing that:
a) dma-mapping works on b) is the only place we can discover the routing information, but also more importantly ensure that the underlying page is still present and the device is not hot unplugged, or in a very theoretical worst case replaced by something else.
It is correct in general case, but here we are talking about MMIO memory, which is "connected" to device X and routing information is stable.
MMIO is literally the only thing we support to P2P to/from as that is how PCIe P2P is defined. And not, it's not stable - devices can be unplugged, and BARs can be reenumerated.
I have a feeling that we are drifting from the current patchset to more general discussion.
The whole idea of new DMA API is to provide flexibility to the callers (subsystems) who are perfectly aware of their data and limitations to implement direct addressing natively.
In this series, device is controlled by VFIO and DMABUF. It is not possible to unplug it without VFIO notices it. In such case, p2pdma_provider and related routing information (DMABUF) will be reevaluated.
So for VFIO + DMABUF, the pointer is very stable.
For other cases (general case), the flow is not changed. Users will continue to call to old and well-known pci_p2pdma_state() to calculate p2p type.
Thanks
 
            On Tue, Jul 29, 2025 at 09:52:09AM +0200, Christoph Hellwig wrote:
On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274... It gives me a way to call p2p code with stable pointer for whole BAR.
That simply can't work.
Why not?
That's the whole point of this, to remove struct page and use something else as a handle for the p2p when doing the DMA API stuff.
Because the struct page is the only thing that:
a) dma-mapping works on
The main point of the "dma-mapping: migrate to physical address-based API" series was to remove the struct page dependencies in the DMA API:
https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/
If it is not complete, then it needs more fixing.
b) is the only place we can discover the routing information,
This patch adds the p2pdma_provider structure to discover the routing information, this is exactly the problem being solved here.
but also more importantly ensure that the underlying page is still present and the device is not hot unplugged, or in a very theoretical worst case replaced by something else.
I already answered this, for DMABUF the DMABUF invalidation scheme is used to control the lifetime and no DMA mapping outlives the provider, and the provider doesn't outlive the driver.
Hotplug works fine. VFIO gets the driver removal callback, it invalidates all the DMABUFs, refuses to re-validate them, destroys the P2P provider, and ends its driver. There is no lifetime issue.
Obviously you cannot use the new p2provider mechanism without some kind of protection against use after hot unplug, but it doesn't have to be struct page based.
Jason
 
            On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Extract the core P2PDMA provider information (device owner and bus offset) from the dev_pagemap into a dedicated p2pdma_provider structure. This creates a cleaner separation between the memory management layer and the P2PDMA functionality.
The new p2pdma_provider structure contains:
- owner: pointer to the providing device
- bus_offset: computed offset for non-host transactions
This refactoring simplifies the P2PDMA state management by removing the need to access pgmap internals directly. The pci_p2pdma_map_state now stores a pointer to the provider instead of the pgmap, making the API more explicit and easier to understand.
Based on the conversation how about this as a commit message:
PCI/P2PDMA: Separate the mmap() support from the core logic
Currently the P2PDMA code requires a pgmap and a struct page to function. The was serving three important purposes:
- DMA API compatibility, where scatterlist required a struct page as input
- Life cycle management, the percpu_ref is used to prevent UAF during device hot unplug
- A way to get the P2P provider data through the pci_p2pdma_pagemap
The DMA API now has a new flow, and has gained phys_addr_t support, so it no longer needs struct pages to perform P2P mapping.
Lifecycle management can be delegated to the user, DMABUF for instance has a suitable invalidation protocol that does not require struct page.
Finding the P2P provider data can also be managed by the caller without need to look it up from the phys_addr.
Split the P2PDMA code into two layers. The optionl upper layer, effectively, provides a way to mmap() P2P memory into a VMA by providing struct page, pgmap, a genalloc and sysfs.
The lower layer provides the actual P2P infrastructure and is wrapped up in a new struct p2pdma_provider. Rework the mmap layer to use new p2pdma_provider based APIs.
Drivers that do not want to put P2P memory into VMA's can allocate a struct p2pdma_provider after probe() starts and free it before remove() completes. When DMA mapping the driver must convey the struct p2pdma_provider to the DMA mapping code along with a phys_addr of the MMIO BAR slice to map. The driver must ensure that no DMA mapping outlives the lifetime of the struct p2pdma_provider.
The intended target of this new API layer is DMABUF. There is usually only a single p2pdma_provider for a DMABUF exporter. Most drivers can establish the p2pdma_provider during probe, access the single instance during DMABUF attach and use that to drive the DMA mapping.
DMABUF provides an invalidation mechanism that can guarentee all DMA is halted and the DMA mappings are undone prior to destroying the struct p2pdma_provider. This ensures there is no UAF through DMABUFs that are lingering past driver removal.
The new p2pdma_provider layer cannot be used to create P2P memory that can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and so on. These use cases must still use the mmap() layer. The p2pdma_provider layer is principally for DMABUF-like use cases where DMABUF natively manages the life cycle and access instead of vmas/pin_user_pages()/struct page.
Jason
 
            From: Leon Romanovsky leonro@nvidia.com
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer to the p2pdma_provider structure instead of the pci_p2pdma_map_state. This simplifies the API by removing the need for callers to extract the provider from the state structure.
The change updates all callers across the kernel (block layer, IOMMU, DMA direct, and HMM) to pass the provider pointer directly, making the code more explicit and reducing unnecessary indirection. This also removes the runtime warning check since callers now have direct control over which provider they use.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- block/blk-mq-dma.c | 2 +- drivers/iommu/dma-iommu.c | 4 ++-- include/linux/pci-p2pdma.h | 7 +++---- kernel/dma/direct.c | 4 ++-- mm/hmm.c | 2 +- 5 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index 37e2142be4f7d..eeac653e3f3bd 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -79,7 +79,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) { - iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr); + iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr); iter->len = vec->len; return true; } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index cd4bc22efa966..1853a969e1978 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1427,8 +1427,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, * as a bus address, __finalise_sg() will copy the dma * address into the output segment. */ - s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(s)); + s->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(s)); sg_dma_len(s) = sg->length; sg_dma_mark_bus_address(s); continue; diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 27a2c399f47da..eef96636c67e6 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -186,16 +186,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, /** * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address * for a PCI_P2PDMA_MAP_BUS_ADDR transfer. - * @state: P2P state structure + * @provider: P2P provider structure * @paddr: physical address to map * * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer. */ static inline dma_addr_t -pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) +pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr) { - WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->mem->bus_offset; + return paddr + provider->bus_offset; }
#endif /* _LINUX_PCI_P2P_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index fa75e30700730..de34ee5903766 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -484,8 +484,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, } break; case PCI_P2PDMA_MAP_BUS_ADDR: - sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(sg)); + sg->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(sg)); sg_dma_mark_bus_address(sg); continue; default: diff --git a/mm/hmm.c b/mm/hmm.c index 9354fae3ae06f..f9970b0e527ed 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -755,7 +755,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map, break; case PCI_P2PDMA_MAP_BUS_ADDR: pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED; - return pci_p2pdma_bus_addr_map(p2pdma_state, paddr); + return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr); default: return DMA_MAPPING_ERROR; }
 
            On Wed, Jul 23, 2025 at 04:00:04PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer to the p2pdma_provider structure instead of the pci_p2pdma_map_state. This simplifies the API by removing the need for callers to extract the provider from the state structure.
The change updates all callers across the kernel (block layer, IOMMU, DMA direct, and HMM) to pass the provider pointer directly, making the code more explicit and reducing unnecessary indirection. This also removes the runtime warning check since callers now have direct control over which provider they use.
Again I don't actually see any simplification here. But maybe I'm missing the ultimate goal here.
 
            From: Leon Romanovsky leonro@nvidia.com
Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA functionality from the optional memory allocation layer. This creates a two-tier architecture:
The core layer provides P2P mapping functionality for physical addresses based on PCI device MMIO BARs and integrates with the DMA API for mapping operations. This layer is required for all P2PDMA users.
The optional upper layer provides memory allocation capabilities including gen_pool allocator, struct page support, and sysfs interface for user space access.
This separation allows subsystems like VFIO to use only the core P2P mapping functionality without the overhead of memory allocation features they don't need. The core functionality is now available through the new pci_p2pdma_enable() function that returns a p2pdma_provider structure.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 108 +++++++++++++++++++++++++------------ include/linux/pci-p2pdma.h | 5 ++ 2 files changed, 80 insertions(+), 33 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 5a310026bd24f..8e2525618d922 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -25,11 +25,12 @@ struct pci_p2pdma { struct gen_pool *pool; bool p2pmem_published; struct xarray map_types; + struct p2pdma_provider mem; };
struct pci_p2pdma_pagemap { struct dev_pagemap pgmap; - struct p2pdma_provider mem; + struct p2pdma_provider *mem; };
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,7 +205,7 @@ static void p2pdma_page_free(struct page *page) struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ struct pci_p2pdma *p2pdma = rcu_dereference_protected( - to_pci_dev(pgmap->mem.owner)->p2pdma, 1); + to_pci_dev(pgmap->mem->owner)->p2pdma, 1); struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -227,44 +228,77 @@ static void pci_p2pdma_release(void *data)
/* Flush and disable pci_alloc_p2p_mem() */ pdev->p2pdma = NULL; - synchronize_rcu(); + if (p2pdma->pool) + synchronize_rcu(); + xa_destroy(&p2pdma->map_types); + + if (!p2pdma->pool) + return;
gen_pool_destroy(p2pdma->pool); sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); - xa_destroy(&p2pdma->map_types); }
-static int pci_p2pdma_setup(struct pci_dev *pdev) +/** + * pci_p2pdma_enable - Enable peer-to-peer DMA support for a PCI device + * @pdev: The PCI device to enable P2PDMA for + * + * This function initializes the peer-to-peer DMA infrastructure for a PCI + * device. It allocates and sets up the necessary data structures to support + * P2PDMA operations, including mapping type tracking. + */ +struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev) { - int error = -ENOMEM; struct pci_p2pdma *p2p; + int ret;
p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); if (!p2p) - return -ENOMEM; + return ERR_PTR(-ENOMEM);
xa_init(&p2p->map_types); + p2p->mem.owner = &pdev->dev; + /* On all p2p platforms bus_offset is the same for all BARs */ + p2p->mem.bus_offset = + pci_bus_address(pdev, 0) - pci_resource_start(pdev, 0);
- p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); - if (!p2p->pool) - goto out; + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (ret) + goto out_p2p;
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); - if (error) - goto out_pool_destroy; + rcu_assign_pointer(pdev->p2pdma, p2p); + return &p2p->mem;
- error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); - if (error) +out_p2p: + devm_kfree(&pdev->dev, p2p); + return ERR_PTR(ret); +} +EXPORT_SYMBOL_GPL(pci_p2pdma_enable); + +static int pci_p2pdma_setup_pool(struct pci_dev *pdev) +{ + struct pci_p2pdma *p2pdma; + int ret; + + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2pdma->pool) + /* We already setup pools, do nothing, */ + return 0; + + p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); + if (!p2pdma->pool) + return -ENOMEM; + + ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); + if (ret) goto out_pool_destroy;
- rcu_assign_pointer(pdev->p2pdma, p2p); return 0;
out_pool_destroy: - gen_pool_destroy(p2p->pool); -out: - devm_kfree(&pdev->dev, p2p); - return error; + gen_pool_destroy(p2pdma->pool); + p2pdma->pool = NULL; + return ret; }
static void pci_p2pdma_unmap_mappings(void *data) @@ -276,7 +310,7 @@ static void pci_p2pdma_unmap_mappings(void *data) * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj, + sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj, &p2pmem_alloc_attr.attr, p2pmem_group.name); } @@ -295,6 +329,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { struct pci_p2pdma_pagemap *p2p_pgmap; + struct p2pdma_provider *mem; struct dev_pagemap *pgmap; struct pci_p2pdma *p2pdma; void *addr; @@ -312,15 +347,22 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, if (size + offset > pci_resource_len(pdev, bar)) return -EINVAL;
- if (!pdev->p2pdma) { - error = pci_p2pdma_setup(pdev); + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); + if (!p2pdma) { + mem = pci_p2pdma_enable(pdev); + if (IS_ERR(mem)) + return PTR_ERR(mem); + + error = pci_p2pdma_setup_pool(pdev); if (error) return error; }
p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL); - if (!p2p_pgmap) - return -ENOMEM; + if (!p2p_pgmap) { + error = -ENOMEM; + goto free_pool; + }
pgmap = &p2p_pgmap->pgmap; pgmap->range.start = pci_resource_start(pdev, bar) + offset; @@ -328,9 +370,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - p2p_pgmap->mem.owner = &pdev->dev; - p2p_pgmap->mem.bus_offset = - pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar); + p2p_pgmap->mem = mem;
addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -343,7 +383,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, if (error) goto pages_free;
- p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr, pci_bus_address(pdev, bar) + offset, range_len(&pgmap->range), dev_to_node(&pdev->dev), @@ -359,7 +398,10 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pages_free: devm_memunmap_pages(&pdev->dev, pgmap); pgmap_free: - devm_kfree(&pdev->dev, pgmap); + devm_kfree(&pdev->dev, p2p_pgmap); +free_pool: + sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); + gen_pool_destroy(p2pdma->pool); return error; } EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource); @@ -1008,11 +1050,11 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, { struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
- if (state->mem == &p2p_pgmap->mem) + if (state->mem == p2p_pgmap->mem) return;
- state->mem = &p2p_pgmap->mem; - state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev); + state->mem = p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev); }
/** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index eef96636c67e6..83f11dc8659a7 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -27,6 +27,7 @@ struct p2pdma_provider { };
#ifdef CONFIG_PCI_P2PDMA +struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev); int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, @@ -45,6 +46,10 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); #else /* CONFIG_PCI_P2PDMA */ +static inline struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev) +{ + return ERR_PTR(-EOPNOTSUPP); +} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) {
 
            From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
The function determines whether peer-to-peer DMA transfers can be done directly through PCI switches (PCI_P2PDMA_MAP_BUS_ADDR) or must go through the host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE), or if the transfer is not supported at all.
This export enables subsystems like VFIO to properly handle P2PDMA operations by querying the mapping type before attempting transfers, ensuring correct DMA address programming and error handling.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 15 ++++++- include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++----------------- 2 files changed, 59 insertions(+), 41 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 8e2525618d922..326c7d88a1690 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -1014,8 +1014,18 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
-static enum pci_p2pdma_map_type -pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) +/** + * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers + * @provider: P2PDMA provider structure + * @dev: Target device for the transfer + * + * Determines how peer-to-peer DMA transfers should be mapped between + * the provider and the target device. The mapping type indicates whether + * the transfer can be done directly through PCI switches or must go + * through the host bridge. + */ +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; struct pci_dev *pdev = to_pci_dev(provider->owner); @@ -1044,6 +1054,7 @@ pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
return type; } +EXPORT_SYMBOL_GPL(pci_p2pdma_map_type);
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 83f11dc8659a7..dea98baee5ce2 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -26,6 +26,45 @@ struct p2pdma_provider { u64 bus_offset; };
+enum pci_p2pdma_map_type { + /* + * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before + * the mapping type has been calculated. Exported routines for the API + * will never return this value. + */ + PCI_P2PDMA_MAP_UNKNOWN = 0, + + /* + * Not a PCI P2PDMA transfer. + */ + PCI_P2PDMA_MAP_NONE, + + /* + * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will + * traverse the host bridge and the host bridge is not in the + * allowlist. DMA Mapping routines should return an error when + * this is returned. + */ + PCI_P2PDMA_MAP_NOT_SUPPORTED, + + /* + * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to + * each other directly through a PCI switch and the transaction will + * not traverse the host bridge. Such a mapping should program + * the DMA engine with PCI bus addresses. + */ + PCI_P2PDMA_MAP_BUS_ADDR, + + /* + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk + * to each other, but the transaction traverses a host bridge on the + * allowlist. In this case, a normal mapping either with CPU physical + * addresses (in the case of dma-direct) or IOVA addresses (in the + * case of IOMMUs) should be used to program the DMA engine. + */ + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, +}; + #ifdef CONFIG_PCI_P2PDMA struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev); int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, @@ -45,6 +84,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev); #else /* CONFIG_PCI_P2PDMA */ static inline struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev) { @@ -105,6 +146,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page, { return sprintf(page, "none\n"); } +static inline enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) +{ + return PCI_P2PDMA_MAP_NOT_SUPPORTED; +} #endif /* CONFIG_PCI_P2PDMA */
@@ -119,45 +165,6 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client) return pci_p2pmem_find_many(&client, 1); }
-enum pci_p2pdma_map_type { - /* - * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before - * the mapping type has been calculated. Exported routines for the API - * will never return this value. - */ - PCI_P2PDMA_MAP_UNKNOWN = 0, - - /* - * Not a PCI P2PDMA transfer. - */ - PCI_P2PDMA_MAP_NONE, - - /* - * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will - * traverse the host bridge and the host bridge is not in the - * allowlist. DMA Mapping routines should return an error when - * this is returned. - */ - PCI_P2PDMA_MAP_NOT_SUPPORTED, - - /* - * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to - * each other directly through a PCI switch and the transaction will - * not traverse the host bridge. Such a mapping should program - * the DMA engine with PCI bus addresses. - */ - PCI_P2PDMA_MAP_BUS_ADDR, - - /* - * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk - * to each other, but the transaction traverses a host bridge on the - * allowlist. In this case, a normal mapping either with CPU physical - * addresses (in the case of dma-direct) or IOVA addresses (in the - * case of IOMMUs) should be used to program the DMA engine. - */ - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, -}; - struct pci_p2pdma_map_state { struct p2pdma_provider *mem; enum pci_p2pdma_map_type map;
 
            On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
External modules have no business doing this.
 
            On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
External modules have no business doing this.
VFIO PCI code is built as module. There is no way to access PCI p2p code without exporting functions in it.
Thanks
 
            On Thu, Jul 24, 2025 at 11:13:21AM +0300, Leon Romanovsky wrote:
On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
External modules have no business doing this.
VFIO PCI code is built as module. There is no way to access PCI p2p code without exporting functions in it.
We never ever export anything for "external" modules, and you really should know that.
 
            On Tue, Jul 29, 2025 at 09:52:30AM +0200, Christoph Hellwig wrote:
On Thu, Jul 24, 2025 at 11:13:21AM +0300, Leon Romanovsky wrote:
On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
External modules have no business doing this.
VFIO PCI code is built as module. There is no way to access PCI p2p code without exporting functions in it.
We never ever export anything for "external" modules, and you really should know that.
It is just a wrong word in commit message. I clearly need it for vfio-pci module and nothing more.
"Never attribute to malice that which is adequately explained by stupidity." - Hanlon's razor.
Thanks
 
            On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
External modules have no business doing this.
So what's the plan?
Today the new DMA API broadly has the pattern:
switch (pci_p2pdma_state(p2pdma_state, dev, page)) { [..] if (dma_use_iova(state)) { ret = dma_iova_link(dev, state, paddr, offset, [..] } else { dma_addr = dma_map_page(dev, page, 0, map->dma_entry_size, [..]
You can't fully use the new API flow without calling pci_p2pdma_state(), which is also not exported today.
Is the idea the full new DMA API flow should not be available to modules? We did export dma_iova_link().
Otherwise, the p2p step needs two functions - a struct page-full and a struct page-less version, and they need to be exported.
The names here are not so good, it would be nicer to have them be a dma_* prefixed function since they are used with the other dma_ functions.
Jason
 
            From: Leon Romanovsky leonro@nvidia.com
Move the struct phys_vec definition from block/blk-mq-dma.c to include/linux/types.h to make it available for use across the kernel.
The phys_vec structure represents a physical address range with a length, which is used by the new physical address-based DMA mapping API. This structure is already used by the block layer and will be needed by upcoming VFIO patches for dma-buf operations.
Moving this definition to types.h provides a centralized location for this common data structure and eliminates code duplication across subsystems that need to work with physical address ranges.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- block/blk-mq-dma.c | 5 ----- include/linux/types.h | 5 +++++ 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index eeac653e3f3bd..b0fa53c353d9d 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -5,11 +5,6 @@ #include <linux/blk-mq-dma.h> #include "blk.h"
-struct phys_vec { - phys_addr_t paddr; - u32 len; -}; - static bool blk_map_iter_next(struct request *req, struct req_iterator *iter, struct phys_vec *vec) { diff --git a/include/linux/types.h b/include/linux/types.h index 6dfdb8e8e4c35..2bc56681b2e62 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -170,6 +170,11 @@ typedef u64 phys_addr_t; typedef u32 phys_addr_t; #endif
+struct phys_vec { + phys_addr_t paddr; + u32 len; +}; + typedef phys_addr_t resource_size_t;
/*
 
            From: Vivek Kasireddy vivek.kasireddy@intel.com
These helpers are useful for managing additional references taken on the device from other associated VFIO modules.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/vfio_main.c | 2 ++ include/linux/vfio.h | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 1fd261efc582d..620a3ee5d04db 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -171,11 +171,13 @@ void vfio_device_put_registration(struct vfio_device *device) if (refcount_dec_and_test(&device->refcount)) complete(&device->comp); } +EXPORT_SYMBOL_GPL(vfio_device_put_registration);
bool vfio_device_try_get_registration(struct vfio_device *device) { return refcount_inc_not_zero(&device->refcount); } +EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
/* * VFIO driver API diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 707b00772ce1f..ba65bbdffd0b2 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -293,6 +293,8 @@ static inline void vfio_put_device(struct vfio_device *device) int vfio_register_group_dev(struct vfio_device *device); int vfio_register_emulated_iommu_dev(struct vfio_device *device); void vfio_unregister_group_dev(struct vfio_device *device); +bool vfio_device_try_get_registration(struct vfio_device *device); +void vfio_device_put_registration(struct vfio_device *device);
int vfio_assign_device_set(struct vfio_device *device, void *set_id); unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
 
            From: Leon Romanovsky leonro@nvidia.com
Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF,
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_core.c | 4 ++++ include/linux/vfio_pci_core.h | 1 + 2 files changed, 5 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 6328c3a05bcdd..1e675daab5753 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -29,6 +29,7 @@ #include <linux/nospec.h> #include <linux/sched/mm.h> #include <linux/iommufd.h> +#include <linux/pci-p2pdma.h> #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif @@ -2091,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); + vdev->provider = pci_p2pdma_enable(vdev->pdev); + if (IS_ERR(vdev->provider)) + return PTR_ERR(vdev->provider); init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index fbb472dd99b36..b017fae251811 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -94,6 +94,7 @@ struct vfio_pci_core_device { struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; struct rw_semaphore memory_lock; + struct p2pdma_provider *provider; };
/* Will be exported for vfio pci drivers usage */
 
            From: Vivek Kasireddy vivek.kasireddy@intel.com
There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 1e675daab5753..5512d13bb8899 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -301,11 +301,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, return 0; }
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -322,12 +320,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, }
static int vfio_pci_core_pm_entry_with_wakeup( - struct vfio_device *device, u32 flags, + struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_low_power_entry_with_wakeup __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); struct vfio_device_low_power_entry_with_wakeup entry; struct eventfd_ctx *efdctx; int ret; @@ -378,11 +374,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) up_write(&vdev->memory_lock); }
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -1475,11 +1469,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret;
@@ -1506,16 +1499,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY: - return vfio_pci_core_pm_entry(device, flags, arg, argsz); + return vfio_pci_core_pm_entry(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP: - return vfio_pci_core_pm_entry_with_wakeup(device, flags, + return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT: - return vfio_pci_core_pm_exit(device, flags, arg, argsz); + return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); default: return -ENOTTY; }
 
            On Wed, 23 Jul 2025 16:00:10 +0300 Leon Romanovsky leon@kernel.org wrote:
From: Vivek Kasireddy vivek.kasireddy@intel.com
There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 1e675daab5753..5512d13bb8899 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -301,11 +301,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, return 0; } -static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) {
- struct vfio_pci_core_device *vdev =
int ret;
container_of(device, struct vfio_pci_core_device, vdev);ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -322,12 +320,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, } static int vfio_pci_core_pm_entry_with_wakeup(
- struct vfio_device *device, u32 flags,
- struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_low_power_entry_with_wakeup __user *arg, size_t argsz)
I'm tempted to fix the line wrapping here, but I think this patch stands on its own. Even if it's rather trivial, it makes sense to consolidate and standardize on the vfio_pci_core_device getting passed around within vfio_pci_core.c. Any reason not to split this off? Thanks,
Alex
{
- struct vfio_pci_core_device *vdev =
struct vfio_device_low_power_entry_with_wakeup entry; struct eventfd_ctx *efdctx; int ret;
container_of(device, struct vfio_pci_core_device, vdev);@@ -378,11 +374,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) up_write(&vdev->memory_lock); } -static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) {
- struct vfio_pci_core_device *vdev =
int ret;
container_of(device, struct vfio_pci_core_device, vdev);ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -1475,11 +1469,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
u32 flags, uuid_t __user *arg,
size_t argsz){
- struct vfio_pci_core_device *vdev =
uuid_t uuid; int ret;
container_of(device, struct vfio_pci_core_device, vdev);@@ -1506,16 +1499,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) {
- struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);- switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
return vfio_pci_core_pm_entry(device, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
return vfio_pci_core_pm_entry_with_wakeup(device, flags,
case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, arg, argsz);
return vfio_pci_core_pm_exit(device, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
return vfio_pci_core_feature_token(device, flags, arg, argsz);
default: return -ENOTTY; }
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 
            On Mon, Jul 28, 2025 at 02:55:53PM -0600, Alex Williamson wrote:
On Wed, 23 Jul 2025 16:00:10 +0300 Leon Romanovsky leon@kernel.org wrote:
From: Vivek Kasireddy vivek.kasireddy@intel.com
There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com
drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-)
<...>
static int vfio_pci_core_pm_entry_with_wakeup(
- struct vfio_device *device, u32 flags,
- struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_low_power_entry_with_wakeup __user *arg, size_t argsz)
I'm tempted to fix the line wrapping here, but I think this patch stands on its own. Even if it's rather trivial, it makes sense to consolidate and standardize on the vfio_pci_core_device getting passed around within vfio_pci_core.c. Any reason not to split this off?
No problem, I will send it separately after merge window ends.
Thanks
Thanks,
Alex
 
            From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/Kconfig | 20 ++ drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/vfio_pci_config.c | 22 +- drivers/vfio/pci/vfio_pci_core.c | 25 ++- drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/dma-buf.h | 1 + include/linux/vfio_pci_core.h | 3 + include/uapi/linux/vfio.h | 19 ++ 9 files changed, 431 insertions(+), 5 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 2b0172f546652..55ae888bf26ae 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -55,6 +55,26 @@ config VFIO_PCI_ZDEV_KVM
To enable s390x KVM vfio-pci extensions, say Y.
+config VFIO_PCI_DMABUF + bool "VFIO PCI extensions for DMA-BUF" + depends on VFIO_PCI_CORE + depends on PCI_P2PDMA && DMA_SHARED_BUFFER + default y + help + Enable support for VFIO PCI extensions that allow exporting + device MMIO regions as DMA-BUFs for peer devices to access via + peer-to-peer (P2P) DMA. + + This feature enables a VFIO-managed PCI device to export a portion + of its MMIO BAR as a DMA-BUF file descriptor, which can be passed + to other userspace drivers or kernel subsystems capable of + initiating DMA to that region. + + Say Y here if you want to enable VFIO DMABUF-based MMIO export + support for peer-to-peer DMA use cases. + + If unsure, say N. + source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/hisilicon/Kconfig" diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index cf00c0a7e55c8..f9155e9c5f630 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,7 +2,9 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o + obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o +vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
vfio-pci-y := vfio_pci.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 8f02f236b5b4b..7e23387a43b4d 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
+ if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) { - if (state >= PCI_D3hot) + if (state >= PCI_D3hot) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
vfio_pci_set_power_state(vdev, state); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, true); up_write(&vdev->memory_lock); } } @@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, true); up_write(&vdev->memory_lock); } } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 5512d13bb8899..e5ab5d1cafd9c 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -29,7 +29,9 @@ #include <linux/nospec.h> #include <linux/sched/mm.h> #include <linux/iommufd.h> +#ifdef CONFIG_VFIO_PCI_DMABUF #include <linux/pci-p2pdma.h> +#endif #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif @@ -288,6 +290,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, * semaphore. */ vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); + if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); return -EINVAL; @@ -371,6 +375,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) */ down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -691,6 +697,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) #endif vfio_pci_core_disable(vdev);
+ vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -1223,7 +1231,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0);
+ vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock);
return ret; @@ -1512,6 +1523,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -2088,9 +2101,13 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); +#ifdef CONFIG_VFIO_PCI_DMABUF vdev->provider = pci_p2pdma_enable(vdev->pdev); if (IS_ERR(vdev->provider)) return PTR_ERR(vdev->provider); + + INIT_LIST_HEAD(&vdev->dmabufs); +#endif init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx);
@@ -2473,11 +2490,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) { + vfio_pci_dma_buf_move(vdev, true); vfio_pci_set_power_state(vdev, PCI_D0); + }
ret = pci_reset_bus(pdev);
+ list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); + vdev = list_last_entry(&dev_set->device_list, struct vfio_pci_core_device, vdev.dev_set_list);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c new file mode 100644 index 0000000000000..5fefcdecd1329 --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -0,0 +1,321 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. + */ +#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS("DMA_BUF"); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + struct phys_vec phys_vec; + u8 revoked : 1; +}; + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + if (!attachment->peer2peer) + return -EOPNOTSUPP; + + if (priv->revoked) + return -ENODEV; + + switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) { + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + break; + case PCI_P2PDMA_MAP_BUS_ADDR: + /* + * There is no need in IOVA at all for this flow. + * We rely on attachment->priv == NULL as a marker + * for this mode. + */ + return 0; + default: + return -EINVAL; + } + + attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL); + if (!attachment->priv) + return -ENOMEM; + + dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->phys_vec.len); + return 0; +} + +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + kfree(attachment->priv); +} + +static void fill_sg_entry(struct scatterlist *sgl, unsigned int length, + dma_addr_t addr) +{ + sg_set_page(sgl, NULL, length, 0); + sg_dma_address(sgl) = addr; + sg_dma_len(sgl) = length; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct p2pdma_provider *provider = priv->vdev->provider; + struct dma_iova_state *state = attachment->priv; + struct phys_vec *phys_vec = &priv->phys_vec; + struct scatterlist *sgl; + struct sg_table *sgt; + dma_addr_t addr; + int ret; + + dma_resv_assert_held(priv->dmabuf->resv); + + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) + return ERR_PTR(-ENOMEM); + + ret = sg_alloc_table(sgt, 1, GFP_KERNEL | __GFP_ZERO); + if (ret) + goto err_kfree_sgt; + + sgl = sgt->sgl; + + if (!state) { + addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr); + } else if (dma_use_iova(state)) { + ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0, + phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC); + if (ret) + goto err_free_table; + + ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len); + if (ret) + goto err_unmap_dma; + + addr = state->addr; + } else { + addr = dma_map_phys(attachment->dev, phys_vec->paddr, + phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC); + ret = dma_mapping_error(attachment->dev, addr); + if (ret) + goto err_free_table; + } + + fill_sg_entry(sgl, phys_vec->len, addr); + return sgt; + +err_unmap_dma: + dma_iova_destroy(attachment->dev, state, phys_vec->len, dir, + DMA_ATTR_SKIP_CPU_SYNC); +err_free_table: + sg_free_table(sgt); +err_kfree_sgt: + kfree(sgt); + return ERR_PTR(ret); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct dma_iova_state *state = attachment->priv; + struct scatterlist *sgl; + int i; + + if (!state) + ; /* Do nothing */ + else if (dma_use_iova(state)) + dma_iova_destroy(attachment->dev, state, priv->phys_vec.len, + dir, DMA_ATTR_SKIP_CPU_SYNC); + else + for_each_sgtable_dma_sg(sgt, sgl, i) + dma_unmap_phys(attachment->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, + DMA_ATTR_SKIP_CPU_SYNC); + + sg_free_table(sgt); + kfree(sgt); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put_registration(&priv->vdev->vdev); + } + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .attach = vfio_pci_dma_buf_attach, + .detach = vfio_pci_dma_buf_detach, + .map_dma_buf = vfio_pci_dma_buf_map, + .release = vfio_pci_dma_buf_release, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, +}; + +static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv, + struct vfio_device_feature_dma_buf *dma_buf) +{ + struct pci_dev *pdev = priv->vdev->pdev; + + priv->phys_vec.len = dma_buf->length; + priv->phys_vec.paddr = pci_resource_start(pdev, dma_buf->region_index); + priv->phys_vec.paddr += dma_buf->offset; +} + +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, + struct vfio_device_feature_dma_buf *dma_buf) +{ + struct pci_dev *pdev = vdev->pdev; + u32 bar = dma_buf->region_index; + u64 offset = dma_buf->offset; + u64 len = dma_buf->length; + resource_size_t bar_size; + u64 sum; + + /* + * For PCI the region_index is the BAR number like everything else. + */ + if (bar >= VFIO_PCI_ROM_REGION_INDEX) + return -ENODEV; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return -EINVAL; + + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) + return -EINVAL; + + bar_size = pci_resource_len(pdev, bar); + if (check_add_overflow(offset, len, &sum) || sum > bar_size) + return -EINVAL; + + return 0; +} + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf = {}; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct vfio_pci_dma_buf *priv; + int ret; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + ret = validate_dmabuf_input(vdev, &get_dma_buf); + if (ret) + return ret; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->vdev = vdev; + dma_ranges_to_p2p_phys(priv, &get_dma_buf); + + if (!vfio_device_try_get_registration(&vdev->vdev)) { + ret = -ENODEV; + goto err_free_priv; + } + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = priv->phys_vec.len; + exp_info.flags = get_dma_buf.open_flags; + exp_info.priv = priv; + + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + goto err_dev_put; + } + + /* dma_buf_put() now frees priv */ + INIT_LIST_HEAD(&priv->dmabufs_elm); + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); + +err_dev_put: + vfio_device_put_registration(&vdev->vdev); +err_free_priv: + kfree(priv); + return ret; +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put_registration(&vdev->vdev); + dma_buf_put(priv->dmabuf); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index a9972eacb2936..28a405f8b97c9 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+#ifdef CONFIG_VFIO_PCI_DMABUF +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static inline int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d58e329ac0e71..f14b413aae48d 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -483,6 +483,7 @@ struct dma_buf_attach_ops { * @dev: device attached to the buffer. * @node: list of dma_buf_attachment, protected by dma_resv lock of the dmabuf. * @peer2peer: true if the importer can handle peer resources without pages. + * #state: DMA structure to provide support for physical addresses DMA interface * @priv: exporter specific attachment data. * @importer_ops: importer operations for this attachment, if provided * dma_buf_map/unmap_attachment() must be called with the dma_resv lock held. diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index b017fae251811..548cbb51bf146 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -94,7 +94,10 @@ struct vfio_pci_core_device { struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; struct rw_semaphore memory_lock; +#ifdef CONFIG_VFIO_PCI_DMABUF struct p2pdma_provider *provider; + struct list_head dmabufs; +#endif };
/* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 5764f315137f9..ad8e303697f97 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1468,6 +1468,25 @@ struct vfio_device_feature_bus_master { }; #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
+/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * regions selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF 11 + +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u64 offset; + __u64 length; +}; + /* -------- API for Type1 VFIO IOMMU -------- */
/**
 
            On 2025-07-23 2:00 pm, Leon Romanovsky wrote: [...]
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)+{
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct p2pdma_provider *provider = priv->vdev->provider;
- struct dma_iova_state *state = attachment->priv;
- struct phys_vec *phys_vec = &priv->phys_vec;
- struct scatterlist *sgl;
- struct sg_table *sgt;
- dma_addr_t addr;
- int ret;
- dma_resv_assert_held(priv->dmabuf->resv);
- sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
- if (!sgt)
return ERR_PTR(-ENOMEM);- ret = sg_alloc_table(sgt, 1, GFP_KERNEL | __GFP_ZERO);
- if (ret)
goto err_kfree_sgt;- sgl = sgt->sgl;
- if (!state) {
addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr);- } else if (dma_use_iova(state)) {
ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0,
phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);
The supposed benefits of this API are only for replacing scatterlists where multiple disjoint pages are being mapped. In this case with just one single contiguous mapping, it is clearly objectively worse to have to bounce in and out of the IOMMU layer 3 separate times and store a dma_map_state, to achieve the exact same operations that a single call to iommu_dma_map_resource() will perform more efficiently and with no external state required.
Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE) rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't guaranteed not to end badly either (e.g. if the system interconnect ends up merging consecutive write bursts and exceeding the target root port's MPS.)
if (ret)
goto err_free_table;
ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len);
if (ret)
goto err_unmap_dma;
addr = state->addr;- } else {
addr = dma_map_phys(attachment->dev, phys_vec->paddr,
phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);
And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu simply isn't worth entertaining) then what purpose do you imagine this call serves at all, other than to hilariously crash under "swiotlb=force"? Even in the case that phys_to_dma(phys_vec->paddr) != phys_vec->paddr, in almost all circumstances (both hardware offsets and CoCo environments with address-based aliasing), it is more likely than not that the latter is still the address you want and the former is wrong (and liable to lead to corruption or fatal system errors), because MMIO and memory remain fundamentally different things.
AFAICS you're *depending* on this call being an effective no-op, and thus only demonstrating that the dma_map_phys() idea is still entirely unnecessary.
ret = dma_mapping_error(attachment->dev, addr);
if (ret)
goto err_free_table;- }
- fill_sg_entry(sgl, phys_vec->len, addr);
- return sgt;
+err_unmap_dma:
- dma_iova_destroy(attachment->dev, state, phys_vec->len, dir,
DMA_ATTR_SKIP_CPU_SYNC);+err_free_table:
- sg_free_table(sgt);
+err_kfree_sgt:
- kfree(sgt);
- return ERR_PTR(ret);
+}
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)+{
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct dma_iova_state *state = attachment->priv;
- struct scatterlist *sgl;
- int i;
- if (!state)
; /* Do nothing */- else if (dma_use_iova(state))
dma_iova_destroy(attachment->dev, state, priv->phys_vec.len,
dir, DMA_ATTR_SKIP_CPU_SYNC);- else
for_each_sgtable_dma_sg(sgt, sgl, i)
The table always has exactly one entry...
Thanks, Robin.
dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir,
DMA_ATTR_SKIP_CPU_SYNC);- sg_free_table(sgt);
- kfree(sgt);
+}
 
            On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
In this case with just one single contiguous mapping, it is clearly objectively worse to have to bounce in and out of the IOMMU layer 3 separate times and store a dma_map_state,
The non-contiguous mappings are comming back, it was in earlier drafts of this. Regardless, the point is to show how to use the general API that we would want to bring into the DRM drivers that don't have contiguity even though VFIO is a bit special.
Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE) rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't guaranteed not to end badly either (e.g. if the system interconnect ends up merging consecutive write bursts and exceeding the target root port's MPS.)
Yes, I recently noticed this too, it should be fixed..
But so we are all on the same page, alot of the PCI P2P systems are setup so P2P does not transit through the iommu. It either takes the ACS path through a switch or it uses ATS and takes a different ACS path through a switch. It only transits through the iommu in misconfigured systems or in the rarer case of P2P between root ports.
And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu simply isn't worth entertaining)
Not quite. DMABUF is sort of upside down.
For example if we are exporting a DMABUF from VFIO and importing it to RDMA then RDMA will call VFIO to make an attachment and the above VFIO code will perform the DMA map to the RDMA struct device. DMABUF returns a dma mapped scatterlist back to the RDMA driver.
The above dma_map_phys(rdma_dev,...) can be in bypass because the rdma device can legitimately be in bypass, or not have a iommu, or whatever.
AFAICS you're *depending* on this call being an effective no-op, and thus only demonstrating that the dma_map_phys() idea is still entirely unnecessary.
It should not be a full no-op, and it should be closer to dma map resource to avoid the mmio issues.
It should be failing for cases where it is not supported (ie swiotlb=force), it should still be calling the legacy dma_ops, and it should be undoing any CC mangling with the address. (also the pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)
Jason
 
            On Tue, Jul 29, 2025 at 05:13:51PM -0300, Jason Gunthorpe wrote:
On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
In this case with just one single contiguous mapping, it is clearly objectively worse to have to bounce in and out of the IOMMU layer 3 separate times and store a dma_map_state,
The non-contiguous mappings are comming back, it was in earlier drafts of this. Regardless, the point is to show how to use the general API that we would want to bring into the DRM drivers that don't have contiguity even though VFIO is a bit special.
Yes, we will see comeback of DMA ranges in v2.
Thanks
 
            On 2025-07-29 9:13 pm, Jason Gunthorpe wrote:
On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
In this case with just one single contiguous mapping, it is clearly objectively worse to have to bounce in and out of the IOMMU layer 3 separate times and store a dma_map_state,
The non-contiguous mappings are comming back, it was in earlier drafts of this. Regardless, the point is to show how to use the general API that we would want to bring into the DRM drivers that don't have contiguity even though VFIO is a bit special.
Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE) rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't guaranteed not to end badly either (e.g. if the system interconnect ends up merging consecutive write bursts and exceeding the target root port's MPS.)
Yes, I recently noticed this too, it should be fixed..
But so we are all on the same page, alot of the PCI P2P systems are setup so P2P does not transit through the iommu. It either takes the ACS path through a switch or it uses ATS and takes a different ACS path through a switch. It only transits through the iommu in misconfigured systems or in the rarer case of P2P between root ports.
For non-ATS (and ATS Untranslated traffic), my understanding is that we rely on ACS upstream redirect to send transactions all the way up to the root port for translation (and without that then they are indeed pure bus addresses, take the pci_p2pdma_bus_addr_map() case, and the rest of this is all irrelevant). In Arm system terms, simpler root ports may well have to run that traffic out to an external SMMU TBU, at which point any P2P would loop back externally through the memory space window in the system interconnect PA space, as opposed to DTI-ATS root complexes that effectively implement their own internal translation agent on the PCIe side. Thus on some systems, even P2P behind a single root port may end up looking functionally the same as the cross-RP case, but in general cross-RP *is* something that people seem to care about as well. We're seeing more and more systems where each slot has its own RP as a separate segment, rather than giant root complexes with a host bridge and everyone on one big happy root bus together.
And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu simply isn't worth entertaining)
Not quite. DMABUF is sort of upside down.
For example if we are exporting a DMABUF from VFIO and importing it to RDMA then RDMA will call VFIO to make an attachment and the above VFIO code will perform the DMA map to the RDMA struct device. DMABUF returns a dma mapped scatterlist back to the RDMA driver.
The above dma_map_phys(rdma_dev,...) can be in bypass because the rdma device can legitimately be in bypass, or not have a iommu, or whatever.
I understand how dma-buf works - obviously DMA mapping for the VFIO device itself while it's not even attached to its default domain would be silly. I mean that any system that has 64-bit coherent PCIe behind an IOMMU such that this VFIO exporter could exist, is realistically going to have the same (or equivalent) IOMMU in front of any potential importers as well. *Especially* if you expect the normal case for P2P to be within a single hierarchy. Thus I was simply commenting that IOMMU_DOMAIN_IDENTITY is the *only* realistic reason to actually expect to interact with dma-direct here.
But of course, if it's not dma-direct because we're on POWER with TCE, rather than VFIO Type1 implying an iommu-dma/dma-direct arch, then who knows? I imagine the complete absence of any mention means this hasn't been tried, or possibly even considered?
AFAICS you're *depending* on this call being an effective no-op, and thus only demonstrating that the dma_map_phys() idea is still entirely unnecessary.
It should not be a full no-op, and it should be closer to dma map resource to avoid the mmio issues.
I don't get what you mean by "not be a full no-op", can you clarify exactly what you think it should be doing? Even if it's just the dma_capable() mask check equivalent to dma_direct_map_resource(), you don't actually want that here either - in that case you'd want to fail the entire attachment to begin with since it can never work.
It should be failing for cases where it is not supported (ie swiotlb=force), it should still be calling the legacy dma_ops, and it should be undoing any CC mangling with the address. (also the pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)
Um, my whole point is that the "legacy DMA ops" cannot be called, because they still assume page-backed memory, so at best are guaranteed to fail; any "CC mangling" assumed for memory is most likely wrong for MMIO, and there simply is no "deal with" at this point.
A device BAR is simply not under control of the trusted hypervisor the same way memory is; whatever (I/G)PA it is at must already be the correct address, if the aliasing scheme even applies at all. Sticking to Arm CCA terminology for example, if a device in shared state tries to import a BAR from a device in locked/private state, there is no notion of touching the shared alias and hoping it somehow magically works (at best it might throw the exporting device into TDISP error state terminally); that attachment simply cannot be allowed. If an shared resource exists in the shared IPA space to begin with, dma_to_phys() will do the wrong thing, and even phys_to_dma() would technically not walk dma_range_map correctly, because both assume "phys" represents kernel memory. However it's also all moot since any attempt at any combination will fail anyway due to SWIOTLB being forced by is_realm_world().
(OK, I admit "crash" wasn't strictly the right word to use there - I keep forgetting that some of the P2P scatterlist support in dma-direct ended up affecting the map_page path too, even though that was never really the functional intent - but hey, the overall result of failing to work as expected is the same.)
Thanks, Robin.
 
            On Wed, Jul 30, 2025 at 03:49:45PM +0100, Robin Murphy wrote:
On 2025-07-29 9:13 pm, Jason Gunthorpe wrote:
On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
In this case with just one single contiguous mapping, it is clearly objectively worse to have to bounce in and out of the IOMMU layer 3 separate times and store a dma_map_state,
The non-contiguous mappings are comming back, it was in earlier drafts of this. Regardless, the point is to show how to use the general API that we would want to bring into the DRM drivers that don't have contiguity even though VFIO is a bit special.
Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE) rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't guaranteed not to end badly either (e.g. if the system interconnect ends up merging consecutive write bursts and exceeding the target root port's MPS.)
Yes, I recently noticed this too, it should be fixed..
But so we are all on the same page, alot of the PCI P2P systems are setup so P2P does not transit through the iommu. It either takes the ACS path through a switch or it uses ATS and takes a different ACS path through a switch. It only transits through the iommu in misconfigured systems or in the rarer case of P2P between root ports.
For non-ATS (and ATS Untranslated traffic), my understanding is that we rely on ACS upstream redirect to send transactions all the way up to the root port for translation (and without that then they are indeed pure bus addresses, take the pci_p2pdma_bus_addr_map() case,
My point is it is common for real systems to take the pci_p2pdma_bus_addr_map() path. Going through the RP is too slow.
all irrelevant). In Arm system terms, simpler root ports may well have to run that traffic out to an external SMMU TBU, at which point any P2P would loop back externally through the memory space window in the system
Many real systems simply don't support this at all :(
But of course, if it's not dma-direct because we're on POWER with TCE, rather than VFIO Type1 implying an iommu-dma/dma-direct arch, then who knows? I imagine the complete absence of any mention means this hasn't been tried, or possibly even considered?
POWER uses dma_ops and the point of this design is that dma_may_phys() will still call the dma_ops. See below.
I don't get what you mean by "not be a full no-op", can you clarify exactly what you think it should be doing? Even if it's just the dma_capable() mask check equivalent to dma_direct_map_resource(), you don't actually want that here either - in that case you'd want to fail the entire attachment to begin with since it can never work.
The expectation would be if the dma mapping can't succeed then the phys map should fail. So if dma_capable() or whatever is not OK then fail inside the loop and unwind back to failing the whole attach.
It should be failing for cases where it is not supported (ie swiotlb=force), it should still be calling the legacy dma_ops, and it should be undoing any CC mangling with the address. (also the pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)
Um, my whole point is that the "legacy DMA ops" cannot be called, because they still assume page-backed memory, so at best are guaranteed to fail; any "CC mangling" assumed for memory is most likely wrong for MMIO, and there simply is no "deal with" at this point.
I think we all agreed it should use the resource path. So legacy DMA ops, including POWER, should end up calling
struct dma_map_ops { dma_addr_t (*map_resource)(struct device *dev, phys_addr_t phys_addr, size_t size, enum dma_data_direction dir, unsigned long attrs);
And if that is NULL it should fail.
A device BAR is simply not under control of the trusted hypervisor the same way memory is;
I'm not sure what you mean? I think it is, at least for CC I expect ACS to be setup to force translation and this squarly puts access to the MMIO BAR under control of the the S2 translation.
In ARM terms I expect that the RMM's S2 will contain the MMIO BAR at the shared IPA (ie top bit set), which will match where the CPU should access it? Linux's IOMMU S2 should mirror this and put the MMIO BAR at the shared IPA. Meaning upon locking the MMIO phys_addr_t effectively moves?
At least I would be surprised to hear that shared MMIO was placed in the private IPA space??
Outside CC we do have a rare configuration where the ACS is not forcing translation and then your remarks are true. Hypervisor must enfroce IPA == GPA == bus addr. It's a painful configuration to make work.
Sticking to Arm CCA terminology for example, if a device in shared state tries to import a BAR from a device in locked/private state, there is no notion of touching the shared alias and hoping it somehow magically works (at best it might throw the exporting device into TDISP error state terminally);
Right, we don't support T=1 DMA yet, or locked devices, but when we do the p2pdma layer needs to be fixed up to catch this and reject it.
I think it is pretty easy, the p2pdma_provider struct can record if the exporting struct device has shared or private MMIO. Then when doing the mapping we require that private MMIO be accessed from T=1.
This should be addressed as part of enabling PCI T=1 support, eg in ARM terms along with Aneesh's series "ARM CCA Device Assignment support"
simply cannot be allowed. If an shared resource exists in the shared IPA space to begin with, dma_to_phys() will do the wrong thing, and even phys_to_dma() would technically not walk dma_range_map correctly, because both assume "phys" represents kernel memory.
As above for CC I am expecting that translation will always be required. The S2 in both the RMM and hypervisor SMMUs should both have shared accessiblity for whatever phys_addr the CPU is using.
So phys_to_dma() just needs to return the normal CPU phys_addr_t to work, and this looks believable to me. ARM forces the shared IPA through dma_addr_unencrypted(), but it is already wrong for the core code to call that function for "encrypted" MMIO.
Not sure about the ranges or dma_to_phys(), I doubt anyone has ever tested this so it probably doesn't work - but I don't see anything architecturally catastrophic here, just some bugs.
However it's also all moot since any attempt at any combination will fail anyway due to SWIOTLB being forced by is_realm_world().
Yep.
Basically P2P for ARM CCA today needs some bug fixing and testing - not surprising. ARM CCA is already rare, and even we don't use P2P under any CC architecture today.
I'm sure it will be fixed as a separate work, at least we will soon care about P2P on ARM CCA working.
Regardless, from a driver perspective none of the CC detail should leak into VFIO. The P2P APIs and the DMA APIs are the right place to abstract it away, and yes they probably fail to do so right now.
I'm guessing that if DMA_ATTR_MMIO is agreed then a DMA_ATTR_MMIO_ENCRYPTED would be the logical step. That should provide enough detail that the DMA API can compute correct addressing.
Maybe this whole discussion improves the case for DMA_ATTR_MMIO.
Jason
 
            On Wed, 23 Jul 2025 16:00:01 +0300 Leon Romanovsky leon@kernel.org wrote:
From: Leon Romanovsky leonro@nvidia.com
Based on blk and DMA patches which will be sent during coming merge window.
This series extends the VFIO PCI subsystem to support exporting MMIO regions from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. This dmabuf approach can be usable by iommufd as well for generic and safe P2P mappings.
I think this will eventually enable DMA mapping of device MMIO through an IOMMUFD IOAS for the VM P2P use cases, right? How do we get from what appears to be a point-to-point mapping between two devices to a shared IOVA between multiple devices? I'm guessing we need IOMMUFD to support something like IOMMU_IOAS_MAP_FILE for dma-buf, but I can't connect all the dots. Thanks,
Alex
 
            On Wed, Jul 30, 2025 at 01:58:46PM -0600, Alex Williamson wrote:
On Wed, 23 Jul 2025 16:00:01 +0300 Leon Romanovsky leon@kernel.org wrote:
From: Leon Romanovsky leonro@nvidia.com
Based on blk and DMA patches which will be sent during coming merge window.
This series extends the VFIO PCI subsystem to support exporting MMIO regions from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. This dmabuf approach can be usable by iommufd as well for generic and safe P2P mappings.
I think this will eventually enable DMA mapping of device MMIO through an IOMMUFD IOAS for the VM P2P use cases, right?
This is the plan
How do we get from what appears to be a point-to-point mapping between two devices to a shared IOVA between multiple devices?
You have it right below, it is a point to point mapping between the vfio device and the iommufd.
I'm guessing we need IOMMUFD to support something like IOMMU_IOAS_MAP_FILE for dma-buf,
1) The dma phys series which needs more work 2) This series to get basic 'movable' DMABUF support in VFIO 3) Add 'revokable' as a DMABUF concept and implement it with mlx5 and vfio 4) Add some way to get the phys_addr list from the DMABUF 5) IOMMU_IOAS_MAP_FILE using a revokable attachment and the phys_addr list. When VFIO does FLR the iommufd can remove the IOPTEs and then put them back when FLR is done.
It is not so much more code, but I think every step will take a lot of work to get agreements.
Then we reuse all of the above with some tweaks for the CC problems too.
Jason
linaro-mm-sig@lists.linaro.org




