 
            On Tue, Jul 29, 2025 at 02:54:13PM -0600, Logan Gunthorpe wrote:
On 2025-07-28 17:11, Jason Gunthorpe wrote:
If the dma mapping for P2P memory doesn't need to create an iommu mapping then that's fine. But it should be the dma-iommu layer to decide that.
So above, we can't use dma-iommu.c, it might not be compiled into the kernel but the dma_map_phys() path is still valid.
This is an easily solved problem. I did a very rough sketch below to say it's really not that hard. (Note it has some rough edges that could be cleaned up and I based it off Leon's git repo which appears to not be the same as what was posted, but the core concept is sound).
I did hope for something like this in the early days, but it proved not so easy to get agreements on details :(
My feeling was we should get some actual examples of using this thing and then it is far easier to discuss ideas, like yours here, to improve it. Many of the discussions kind of got confused without enough actual usering code for everyone to refer to.
For instance the nvme use case is a big driver for the API design, and it is quite different from these simpler flows, this idea needs to see how it would work there.
Maybe this idea could also have provider = NULL meaning it is CPU cachable memory?
+static inline void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider,
struct device *dev, struct dma_iova_state *state, phys_addr_t phys,
size_t size)+{ +}
Can't be empty - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE vs PCI_P2PDMA_MAP_BUS_ADDR still matters so it still must set dma_iova_state::bus_addr to get dma_map_phys_prealloc() to do the right thing.
Still, it would make sense to put something like that in dma/mapping.c and rely on the static inline stub for dma_iova_try_alloc()..
for (i = 0; i < priv->nr_ranges; i++) {
if (!state) {
addr = pci_p2pdma_bus_addr_map(provider,
phys_vec[i].paddr);
} else if (dma_use_iova(state)) {
ret = dma_iova_link(attachment->dev, state,
phys_vec[i].paddr, 0,
phys_vec[i].len, dir, attrs);
if (ret)
goto err_unmap_dma;
mapped_len += phys_vec[i].len;
} else {
addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
phys_vec[i].len, dir, attrs);
ret = dma_mapping_error(attachment->dev, addr);
if (ret)
goto err_unmap_dma;
}
addr = dma_map_phys_prealloc(attachment->dev, phys_vec[i].paddr,
phys_vec[i].len, dir, attrs, state,
provider);
There was a draft of something like this at some point. The DMA_MAPPING_USE_IOVA is a new twist though
#define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1)) struct dma_iova_state { dma_addr_t addr; u64 __size;
- bool bus_addr;
};
Gowing this structure has been strongly pushed back on. This probably can be solved in some other way, a bitfield on size perhaps..
+dma_addr_t dma_map_phys_prealloc(struct device *dev, phys_addr_t phys, size_t size,
enum dma_data_direction dir, unsigned long attrs,
struct dma_iova_state *state, struct p2pdma_provider *provider)+{
- int ret;
- if (state->bus_addr)
return pci_p2pdma_bus_addr_map(provider, phys);- if (dma_use_iova(state)) {
ret = dma_iova_link(dev, state, phys, 0, size, dir, attrs);
if (ret)
return DMA_MAPPING_ERROR;
return DMA_MAPPING_USE_IOVA;- }
- return dma_map_phys(dev, phys, size, dir, attrs);
+} +EXPORT_SYMBOL_GPL(dma_map_phys_prealloc);
I would be tempted to inline this
Overall, yeah I would certainly welcome improvements like this if everyone can agree, but I'd really like to see nvme merged before we start working on ideas. That way the proposal can be properly evaluated by all the stake holders.
Jason