On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
I think for 90% of exporters pfn would fit, but there's some really funny ones where you cannot get a cpu pfn by design. So we need to keep the pfn-less interfaces around. But ideally for the pfn-capable exporters we'd have helpers/common code that just implements all the other interfaces.
There is no way to have dma address without a PFN in Linux right now. How would you generate them? That implies you have an IOMMU that can generate IOVAs for something that doesn't have a physical address at all.
Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.
On Wed, Jan 15, 2025 at 10:32:34AM +0100, Christoph Hellwig wrote:
On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
I think for 90% of exporters pfn would fit, but there's some really funny ones where you cannot get a cpu pfn by design. So we need to keep the pfn-less interfaces around. But ideally for the pfn-capable exporters we'd have helpers/common code that just implements all the other interfaces.
There is no way to have dma address without a PFN in Linux right now. How would you generate them? That implies you have an IOMMU that can generate IOVAs for something that doesn't have a physical address at all.
Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.
He is talking about private interconnect hidden inside clusters of devices.
Ie the system may have many GPUs and those GPUs have their own private interconnect between them. It is not PCI, and packets don't transit through the CPU SOC at all, so the IOMMU is not involved.
DMA can happen on that private interconnect, but from a Linux perspective it is not DMA API DMA, and the addresses used to describe it are not part of the CPU address space. The initiating device will have a way to choose which path the DMA goes through when setting up the DMA.
Effectively if you look at one of these complex GPU systems you will have a physical bit of memory, say HBM memory located on the GPU. Then from an OS perspective we have a whole bunch of different representations/addresses of that very same memory. A Grace/Hopper system would have at least three different addresses (ZONE_MOVABLE, a PCI MMIO aperture, and a global NVLink address). Each different address effectively represents a different physical interconnect multipath, and an initiator may have three different routes/addresses available to reach the same physical target memory.
Part of what DMABUF needs to do is pick which multi-path will be used between expoter/importer.
So, the hack today has the DMABUF exporter GPU driver understand the importer is part of the private interconnect and then generate a scatterlist with a NULL sg_page, but a sg_dma_addr that encodes the private global address on the hidden interconnect. Somehow the importer knows this has happened and programs its HW to use the private path.
Jason
On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote:
Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.
He is talking about private interconnect hidden inside clusters of devices.
Ie the system may have many GPUs and those GPUs have their own private interconnect between them. It is not PCI, and packets don't transit through the CPU SOC at all, so the IOMMU is not involved.
DMA can happen on that private interconnect, but from a Linux perspective it is not DMA API DMA, and the addresses used to describe it are not part of the CPU address space. The initiating device will have a way to choose which path the DMA goes through when setting up the DMA.
So how is this in any way relevant to dma_buf which operates on a dma_addr_t right now and thus by definition can't be used for these?
On Thu, Jan 16, 2025 at 06:33:48AM +0100, Christoph Hellwig wrote:
On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote:
Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.
He is talking about private interconnect hidden inside clusters of devices.
Ie the system may have many GPUs and those GPUs have their own private interconnect between them. It is not PCI, and packets don't transit through the CPU SOC at all, so the IOMMU is not involved.
DMA can happen on that private interconnect, but from a Linux perspective it is not DMA API DMA, and the addresses used to describe it are not part of the CPU address space. The initiating device will have a way to choose which path the DMA goes through when setting up the DMA.
So how is this in any way relevant to dma_buf which operates on a dma_addr_t right now and thus by definition can't be used for these?
Oh, well since this private stuff exists the DRM folks implemented it and used dmabuf to hook it together tough the uAPI. To make it work it abuses scatterlist and dma_addr_t to carry this other information.
Thus the pushback in this thread we can't naively fixup dmabuf because this non-dma_addr_t abuse exists and is uAPI. So it also needs some improved architecture to move forward :\
Basically, the scatterlist in dmabuf API does not follow any of the normal rules scatterlist should follow. It is not using the semantics of dma_addr_t even though that is the type. It is really just an array of addr/len pairs - we can't reason about it in the normal way :(
Jason
linaro-mm-sig@lists.linaro.org