Am 22.01.25 um 12:04 schrieb Simona Vetter:
On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: What is going wrong with your email? You replied to Simona, but Simona Vetter simona.vetter@ffwll.ch is dropped from the To/CC list??? I added the address back, but seems like a weird thing to happen.
Might also be funny mailing list stuff, depending how you get these. I read mails over lore and pretty much ignore cc (unless it's not also on any list, since those tend to be security issues) because I get cc'ed on way too much stuff for that to be a useful signal.
Oh I see, you are sending a Mail-followup-to header that excludes your address, so you don't get any emails at all.. My mutt is dropping you as well.
I'm having all kind of funny phenomena with AMDs mail servers since coming back from xmas vacation.
From the news it looks like Outlook on Windows has a new major security issue where just viewing a mail can compromise the system and my educated guess is that our IT guys went into panic mode because of this and has changed something.
[SNIP] I have been assuming that dmabuf mmap remains unchanged, that exporters will continue to implement that mmap() callback as today.
That sounds really really good to me because that was my major concern when you noted that you want to have PFNs to build up KVM page tables.
But you don't want to handle mmap() on your own, you basically don't want to have a VMA for this stuff at all, correct?
My main interest has been what data structure is produced in the attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical list" that is some efficient coding of meta/addr/len tuples that works well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have something like an iterator/cursor based approach similar to what the new DMA API is doing.
So, I imagine, struct dma_buf_attachment gaining an optional feature negotiation and then we have in dma_buf_attachment:
union { struct sg_table *sgt; struct physical_list *phyr;
};
That's basicaly it, an alternative to scatterlist that has a clean architecture.
I would rather suggest something like dma_buf_attachment() gets offset and size to map and returns a cursor object you can use to get your address, length and access attributes.
And then you can iterate over this cursor and fill in your importer data structure with the necessary information.
This way neither the exporter nor the importer need to convert their data back and forth between their specific representations of the information.
Now, if you are asking if the current dmabuf mmap callback can be improved with the above? Maybe? phyr should have the neccessary information inside it to populate a VMA - eventually even fully correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
See the exporter needs to be informed about page faults on the VMA to eventually wait for operations to end and sync caches.
Otherwise we either potentially allow access to freed up or re-used memory or run into issues with device cache coherency.
So, you could imagine that exporters could just have one routine to generate the phyr list and that goes into the attachment, goes into some common code to fill VMA PTEs, and some other common code that will convert it into the DMABUF scatterlist. If performance is not a concern with these data structure conversions it could be an appealing simplification.
And yes, I could imagine the meta information being descriptive enough to support the private interconnect cases, the common code could detect private meta information and just cleanly fail.
I'm kinda leaning towards entirely separate dma-buf interfaces for the new phyr stuff, because I fear that adding that to the existing ones will only make the chaos worse. But that aside sounds all reasonable, and even that could just be too much worry on my side and mixing phyr into existing attachments (with a pile of importer/exporter flags probably) is fine.
I lean into the other direction.
Dmitry and Thomas have done a really good job at cleaning up all the interaction between dynamic and static exporters / importers.
Especially that we now have consistent locking for map_dma_buf() and unmap_dma_buf() should make that transition rather straight forward.
For the existing dma-buf importers/exporters I'm kinda hoping for a pure dma_addr_t based list eventually. Going all the way to a phyr based approach for everyone might be too much churn, there's some real bad cruft there. It's not going to work for every case, but it covers a lot of them and might be less pain for existing importers.
The point is we have use cases that won't work without exchanging DMA addresses any more.
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
But in theory it should be possible to use phyr everywhere eventually, as long as there's no obviously api-rules-breaking way to go from a phyr back to a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
What we can do is to add an address space description to the addresses, e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device private bus addr or in the case of sharing with iommufd and KVM PFNs.
Regards, Christian.
At least the device mapping / dma_buf_attachment side should be doable with just the pfn and the new dma-api?
Yes, that would be my first goal post. Figure out some meta information and a container data structure that allows struct page-less P2P mapping through the new DMA API.
I'm hoping we can get to something where we describe not just how the pfns should be DMA mapped, but also can describe how they should be CPU mapped. For instance that this PFN space is always mapped uncachable, in CPU and in IOMMU.
I was pondering whether dma_mmap and friends would be a good place to prototype this and go for a fully generic implementation. But then even those have _wc/_uncached variants.
Given that the inability to correctly DMA map P2P MMIO without struct page is a current pain point and current source of hacks in dmabuf exporters, I wanted to make resolving that a priority.
However, if you mean what I described above for "fully generic [dmabuf mmap] implementation", then we'd have the phyr datastructure as a dependency to attempt that work.
phyr, and particularly the meta information, has a number of stakeholders. I was thinking of going first with rdma's memory registration flow because we are now pretty close to being able to do such a big change, and it can demonstrate most of the requirements.
But that doesn't mean mmap couldn't go concurrently on the same agreed datastructure if people are interested.
Yeah cpu mmap needs a lot more, going with a very limited p2p use-case first only makes sense.
We also have current bugs in the iommu/vfio side where we are fudging CC stuff, like assuming CPU memory is encrypted (not always true) and that MMIO is non-encrypted (not always true)
tbf CC pte flags I just don't grok at all. I've once tried to understand what current exporters and gpu drivers do and just gave up. But that's also a bit why I'm worried here because it's an enigma to me.
For CC, inside the secure world, is some information if each PFN inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs) pointing at the PFN must match the secure world's view of 'encrypted'. The VM can ask the secure world to change its view at runtime.
The way CC has been bolted on to the kernel so far laregly hides this from drivers, so it is troubled to tell in driver code if the PFN you have is 'encrypted' or not. Right now the general rule (that is not always true) is that struct page CPU memory is encrypted and everything else is decrypted.
So right now, you can mostly ignore it and the above assumption largely happens for you transparently.
However, soon we will have encrypted P2P MMIO which will stress this hiding strategy.
It's already breaking with stuff like virtual gpu drivers, vmwgfx is fiddling around with these bits (at least last I tried to understand this all) and I think a few others do too.
I thought iommuv2 (or whatever linux calls these) has full fault support and could support current move semantics. But yeah for iommu without fault support we need some kind of pin or a newly formalized revoke model.
No, this is HW dependent, including PCI device, and I'm aware of no HW that fully implements this in a way that could be useful to implement arbitary move semantics for VFIO..
Hm I thought we've had at least prototypes floating around of device fault repair, but I guess that only works with ATS/pasid stuff and not general iommu traffic from devices. Definitely needs some device cooperation since the timeouts of a full fault are almost endless.
Yes, exactly. What all real devices I'm aware have done is make a subset of their traffic work with ATS and PRI, but not all their traffic. Without *all* traffic you can't make any generic assumption in the iommu that a transient non-present won't be fatal to the device.
Stuff like dmabuf move semantics rely on transient non-present being non-disruptive...
Ah now I get it, at the iommu level you have to pessimistically assume whether a device can handle a fault, and none can for all traffic. I was thinking too much about the driver level where generally the dma-buf you importer are only used for the subset of device functions that can cope with faults on many devices.
Cheers, Sima
On Wed, Jan 22, 2025 at 02:29:09PM +0100, Christian König wrote:
I'm having all kind of funny phenomena with AMDs mail servers since coming back from xmas vacation.
:(
A few years back our IT fully migrated our email to into Office 365 cloud and gave up all the crazy half on-prem stuff they were doing. The mail started working fully perfectly after that, as long as you use MS's servers directly :\
But you don't want to handle mmap() on your own, you basically don't want to have a VMA for this stuff at all, correct?
Right, we have no interest in mmap, VMAs or struct page in rdma/kvm/iommu.
My main interest has been what data structure is produced in the attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical list" that is some efficient coding of meta/addr/len tuples that works well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have something like an iterator/cursor based approach similar to what the new DMA API is doing.
I'm certainly open to this idea. There may be some technical challenges, it is a big change from scatterlist today, and function-pointer-per-page sounds like bad performance if there are alot of pages..
RDMA would probably have to stuff this immediately into something like a phyr anyhow because it needs to fully extent the thing being mapped to figure out what the HW page size and geometry should be - that would be trivial though, and a RDMA problem.
Now, if you are asking if the current dmabuf mmap callback can be improved with the above? Maybe? phyr should have the neccessary information inside it to populate a VMA - eventually even fully correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
Note I said "populate a VMA", ie a helper to build the VMA PTEs only.
See the exporter needs to be informed about page faults on the VMA to eventually wait for operations to end and sync caches.
All of this would still have to be provided outside in the same way as today.
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
It is not a "dma address" in the sense of a dma_addr_t that was output from the DMA API. I think that subtle distinction is very important. When I say pfn/dma address I'm really only talking about standard DMA API flows, used by generic drivers.
IMHO, DMABUF needs a private address "escape hatch", and cooperating drivers should do whatever they want when using that flow. The address is *fully private*, so the co-operating drivers can do whatever they want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and iommu_map in the importer? Also fine! Private is private.
But in theory it should be possible to use phyr everywhere eventually, as long as there's no obviously api-rules-breaking way to go from a phyr back to a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
I remain skeptical of this.. Aside from all the technical reasons I already outlined..
I think it is too much work to have the exporters conditionally build all sorts of different representations of the same thing depending on the importer. Like having alot of DRM drivers generate both a PFN and DMA mapped list in their export code doesn't sound very appealing to me at all.
It makes sense that a driver would be able to conditionally generate private and generic based on negotiation, but IMHO, not more than one flavour of generic..
Jason
Am 22.01.25 um 15:37 schrieb Jason Gunthorpe:
My main interest has been what data structure is produced in the attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical list" that is some efficient coding of meta/addr/len tuples that works well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have something like an iterator/cursor based approach similar to what the new DMA API is doing.
I'm certainly open to this idea. There may be some technical challenges, it is a big change from scatterlist today, and function-pointer-per-page sounds like bad performance if there are alot of pages..
RDMA would probably have to stuff this immediately into something like a phyr anyhow because it needs to fully extent the thing being mapped to figure out what the HW page size and geometry should be - that would be trivial though, and a RDMA problem.
Now, if you are asking if the current dmabuf mmap callback can be improved with the above? Maybe? phyr should have the neccessary information inside it to populate a VMA - eventually even fully correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
Note I said "populate a VMA", ie a helper to build the VMA PTEs only.
See the exporter needs to be informed about page faults on the VMA to eventually wait for operations to end and sync caches.
All of this would still have to be provided outside in the same way as today.
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
All the importer gets is when you want to access this information use this address here.
It is not a "dma address" in the sense of a dma_addr_t that was output from the DMA API. I think that subtle distinction is very important. When I say pfn/dma address I'm really only talking about standard DMA API flows, used by generic drivers.
IMHO, DMABUF needs a private address "escape hatch", and cooperating drivers should do whatever they want when using that flow. The address is *fully private*, so the co-operating drivers can do whatever they want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and iommu_map in the importer? Also fine! Private is private.
But in theory it should be possible to use phyr everywhere eventually, as long as there's no obviously api-rules-breaking way to go from a phyr back to a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
I remain skeptical of this.. Aside from all the technical reasons I already outlined..
I think it is too much work to have the exporters conditionally build all sorts of different representations of the same thing depending on the importer. Like having alot of DRM drivers generate both a PFN and DMA mapped list in their export code doesn't sound very appealing to me at all.
Well from experience I can say that it is actually the other way around.
We have a very limited number of exporters and a lot of different importers. So having complexity in the exporter instead of the importer is absolutely beneficial.
PFN is the special case, in other words this is the private address passed around. And I will push hard to not support that in the DRM drivers nor any DMA buf heap.
It makes sense that a driver would be able to conditionally generate private and generic based on negotiation, but IMHO, not more than one flavour of generic..
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
Regards, Christian.
Jason
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*.
Some of that mess in tegra host1x around this area is not well structured, it should not be implicitly setting up domains for drivers. It is old code that hasn't been updated to use the new iommu subsystem approach for driver controled non-DMA API domains.
The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API.
Again, either you are using the DMA API and you work in generic ways with generic devices or it is "private" and only co-operating drivers can interwork with private addresses. A private address must not ever be sent to a DMA API using driver and vice versa.
IMHO this is an important architecture point and why Christoph was frowning on abusing dma_addr_t to represent things that did NOT come out of the DMA API.
We have a very limited number of exporters and a lot of different importers. So having complexity in the exporter instead of the importer is absolutely beneficial.
Isn't every DRM driver both an importer and exporter? That is what I was expecting at least..
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit.
For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping.
The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API.
Jason
Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains.
Hui? Why the heck should a driver own it's iommu domain?
The domain is owned and assigned by the PCI subsystem under Linux.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*.
As far as I know that is simply not correct. Currently IOMMU domains/groups are usually shared between devices.
Especially multi function devices get only a single IOMMU domain.
Some of that mess in tegra host1x around this area is not well structured, it should not be implicitly setting up domains for drivers. It is old code that hasn't been updated to use the new iommu subsystem approach for driver controled non-DMA API domains.
The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API.
Why the heck should we do this?
That drivers manage all of that on their own sounds like a massive step in the wrong direction.
Again, either you are using the DMA API and you work in generic ways with generic devices or it is "private" and only co-operating drivers can interwork with private addresses. A private address must not ever be sent to a DMA API using driver and vice versa.
IMHO this is an important architecture point and why Christoph was frowning on abusing dma_addr_t to represent things that did NOT come out of the DMA API.
We have a very limited number of exporters and a lot of different importers. So having complexity in the exporter instead of the importer is absolutely beneficial.
Isn't every DRM driver both an importer and exporter? That is what I was expecting at least..
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit.
Yeah and I'm really questioning this design goal. That sounds like totally going into the wrong direction just because of the RDMA drivers.
For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping.
Why? What is the technical background here?
The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API.
Yeah but then express those as requirements to the DMA API and not move all the important decisions into the driver where they are implemented over and over again and potentially broken halve the time.
See drivers are supposed to be simple, small and stupid. They should be controlled by the core OS and not allowed to do whatever they want.
Driver developers are not trust able to always get everything right if you make it as complicated as this.
Regards, Christian.
Jason
Sending it as text mail once more.
Am 23.01.25 um 15:32 schrieb Christian König:
Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains.
Hui? Why the heck should a driver own it's iommu domain?
The domain is owned and assigned by the PCI subsystem under Linux.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*.
As far as I know that is simply not correct. Currently IOMMU domains/groups are usually shared between devices.
Especially multi function devices get only a single IOMMU domain.
Some of that mess in tegra host1x around this area is not well structured, it should not be implicitly setting up domains for drivers. It is old code that hasn't been updated to use the new iommu subsystem approach for driver controled non-DMA API domains.
The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API.
Why the heck should we do this?
That drivers manage all of that on their own sounds like a massive step in the wrong direction.
Again, either you are using the DMA API and you work in generic ways with generic devices or it is "private" and only co-operating drivers can interwork with private addresses. A private address must not ever be sent to a DMA API using driver and vice versa.
IMHO this is an important architecture point and why Christoph was frowning on abusing dma_addr_t to represent things that did NOT come out of the DMA API.
We have a very limited number of exporters and a lot of different importers. So having complexity in the exporter instead of the importer is absolutely beneficial.
Isn't every DRM driver both an importer and exporter? That is what I was expecting at least..
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit.
Yeah and I'm really questioning this design goal. That sounds like totally going into the wrong direction just because of the RDMA drivers.
For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping.
Why? What is the technical background here?
The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API.
Yeah but then express those as requirements to the DMA API and not move all the important decisions into the driver where they are implemented over and over again and potentially broken halve the time.
See drivers are supposed to be simple, small and stupid. They should be controlled by the core OS and not allowed to do whatever they want.
Driver developers are not trust able to always get everything right if you make it as complicated as this.
Regards, Christian.
Jason
On Thu, Jan 23, 2025 at 03:35:21PM +0100, Christian König wrote:
Sending it as text mail once more.
Am 23.01.25 um 15:32 schrieb Christian König:
Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings.
IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains.
Hui? Why the heck should a driver own it's iommu domain?
I don't know, you are the one saying the drivers have special shared iommu_domains so DMA BUF need some special design to accommodate it.
I'm aware that DRM drivers do directly call into the iommu subsystem and do directly manage their own IOVA. I assumed this is what you were talkinga bout. See below.
The domain is owned and assigned by the PCI subsystem under Linux.
That domain is *exclusively* owned by the DMA API and is only accessed via maps created by DMA API calls.
If you are using the DMA API correctly then all of this is abstracted and none of it matters to you. There is no concept of "shared domains" in the DMA API.
You call the DMA API, you get a dma_addr_t that is valid for a *single* device, you program it in HW. That is all. There is no reason to dig deeper than this.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*.
As far as I know that is simply not correct. Currently IOMMU domains/groups are usually shared between devices.
No, the opposite. The iommu subsystem tries to maximally isolate devices up to the HW limit.
On server platforms every device is expected to get its own iommu domain.
Especially multi function devices get only a single IOMMU domain.
Only if the PCI HW doesn't support ACS.
This is all DMA API internal details you shouldn't even be talking about at the DMA BUF level. It is all hidden and simply does not matter to DMA BUF at all.
The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API.
Why the heck should we do this?
That drivers manage all of that on their own sounds like a massive step in the wrong direction.
I am talking about DRM drivers that HAVE to manage their own for some reason I don't know. eg:
drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c: tdev->iommu.domain = iommu_domain_alloc(&platform_bus_type); drivers/gpu/drm/msm/msm_iommu.c: domain = iommu_paging_domain_alloc(dev); drivers/gpu/drm/rockchip/rockchip_drm_drv.c: private->domain = iommu_paging_domain_alloc(private->iommu_dev); drivers/gpu/drm/tegra/drm.c: tegra->domain = iommu_paging_domain_alloc(dma_dev); drivers/gpu/host1x/dev.c: host->domain = iommu_paging_domain_alloc(host->dev);
Normal simple drivers should never be calling these functions!
If you are calling these functions you are not using the DMA API, and, yes, some cases like tegra n1x are actively sharing these special domains across multiple devices and drivers.
If you want to pass an IOVA in one of these special driver-created domains then it would be some private address in DMABUF that only works on drivers that have understood they attached to these manually created domains. No DMA API involvement here.
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit.
Yeah and I'm really questioning this design goal. That sounds like totally going into the wrong direction just because of the RDMA drivers.
Actually it is storage that motivates this. It is just pointless to allocate a dma_addr_t list in the fast path when you don't need it. You can stream the dma_addr_t directly into HW structures that are necessary and already allocated.
For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping.
Why? What is the technical background here?
dma-iommu.c chooses an IOVA alignment based on its own reasoning that is not always compatible with the HW. The HW can optimize if the IOVA alignment meets certain restrictions. Much like page tables in a GPU.
The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API.
Yeah but then express those as requirements to the DMA API and not move all the important decisions into the driver where they are implemented over and over again and potentially broken halve the time.
It wouild be in the DMA API, just the per-mapping portion of the API.
Same as the multipath, the ATS, and more. It is all per-mapping descisions of the executing HW, not global decisions or something like.
Jason
Am 23.01.25 um 16:02 schrieb Jason Gunthorpe:
On Thu, Jan 23, 2025 at 03:35:21PM +0100, Christian König wrote:
Sending it as text mail once more.
Am 23.01.25 um 15:32 schrieb Christian König:
Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
> For example we have cases with multiple devices are in the same IOMMU domain > and re-using their DMA address mappings. IMHO this is just another flavour of "private" address flow between two cooperating drivers.
Well that's the point. The inporter is not cooperating here.
If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains.
Hui? Why the heck should a driver own it's iommu domain?
I don't know, you are the one saying the drivers have special shared iommu_domains so DMA BUF need some special design to accommodate it.
I'm aware that DRM drivers do directly call into the iommu subsystem and do directly manage their own IOVA. I assumed this is what you were talkinga bout. See below.
No, no there are much more cases where drivers simply assume that they are in the same iommu domain for different devices. E.g. that different PCI endpoints can use the same dma_addr_t.
For example those classic sound devices for HDMI audio on graphics cards work like this. It's a very long time that I looked into that, but I think that this is even a HW limitation.
In other words if the device handled by the generic ALSA driver and the GPU are not in the same iommu domain you run into trouble.
The domain is owned and assigned by the PCI subsystem under Linux.
That domain is *exclusively* owned by the DMA API and is only accessed via maps created by DMA API calls.
If you are using the DMA API correctly then all of this is abstracted and none of it matters to you. There is no concept of "shared domains" in the DMA API.
Well it might never been documented but I know of quite a bunch of different cases that assume that a DMA addr will just ultimately work for some other device/driver as well.
Of hand I know at least the generic ALSA driver case, some V4L driver (but that might use the same PCI endpoint, not 100% sure) and a multi GPU case which works like this.
You call the DMA API, you get a dma_addr_t that is valid for a *single* device, you program it in HW. That is all. There is no reason to dig deeper than this.
The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter.
Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*.
As far as I know that is simply not correct. Currently IOMMU domains/groups are usually shared between devices.
No, the opposite. The iommu subsystem tries to maximally isolate devices up to the HW limit.
On server platforms every device is expected to get its own iommu domain.
Especially multi function devices get only a single IOMMU domain.
Only if the PCI HW doesn't support ACS.
Ah, yes that can certainly be.
This is all DMA API internal details you shouldn't even be talking about at the DMA BUF level. It is all hidden and simply does not matter to DMA BUF at all.
Well we somehow need to support the existing use cases with the new API.
The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API.
Why the heck should we do this?
That drivers manage all of that on their own sounds like a massive step in the wrong direction.
I am talking about DRM drivers that HAVE to manage their own for some reason I don't know. eg:
drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c: tdev->iommu.domain = iommu_domain_alloc(&platform_bus_type); drivers/gpu/drm/msm/msm_iommu.c: domain = iommu_paging_domain_alloc(dev); drivers/gpu/drm/rockchip/rockchip_drm_drv.c: private->domain = iommu_paging_domain_alloc(private->iommu_dev); drivers/gpu/drm/tegra/drm.c: tegra->domain = iommu_paging_domain_alloc(dma_dev); drivers/gpu/host1x/dev.c: host->domain = iommu_paging_domain_alloc(host->dev);
Normal simple drivers should never be calling these functions!
If you are calling these functions you are not using the DMA API, and, yes, some cases like tegra n1x are actively sharing these special domains across multiple devices and drivers.
If you want to pass an IOVA in one of these special driver-created domains then it would be some private address in DMABUF that only works on drivers that have understood they attached to these manually created domains. No DMA API involvement here.
That won't fly like this. That would break at least the ALSA use case and potentially quite a bunch of others.
I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly.
It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit.
Yeah and I'm really questioning this design goal. That sounds like totally going into the wrong direction just because of the RDMA drivers.
Actually it is storage that motivates this. It is just pointless to allocate a dma_addr_t list in the fast path when you don't need it. You can stream the dma_addr_t directly into HW structures that are necessary and already allocated.
That's what I can 100% agree on.
For GPUs its basically the same, e.g. converting from the dma_addr_t to your native presentation is just additional overhead nobody needs.
For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping.
Why? What is the technical background here?
dma-iommu.c chooses an IOVA alignment based on its own reasoning that is not always compatible with the HW. The HW can optimize if the IOVA alignment meets certain restrictions. Much like page tables in a GPU.
Yeah, but why can't we tell the DMA API those restrictions instead of letting the driver manage the address space themselves?
The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API.
Yeah but then express those as requirements to the DMA API and not move all the important decisions into the driver where they are implemented over and over again and potentially broken halve the time.
It wouild be in the DMA API, just the per-mapping portion of the API.
Same as the multipath, the ATS, and more. It is all per-mapping descisions of the executing HW, not global decisions or something like.
So the DMA API has some structure or similar to describe the necessary per-mapping properties?
Regards, Christian.
Jason
On Thu, Jan 23, 2025 at 04:48:29PM +0100, Christian König wrote:
No, no there are much more cases where drivers simply assume that they are in the same iommu domain for different devices.
This is an illegal assumption and invalid way to use the DMA API. Do not do that, do not architect things in DMABUF to permit that.
The dma_addr_t out of the DMA API is only usable by the device passed in, period full stop. If you want to use it with two devices then call the DMA API twice.
E.g. that different PCI endpoints can use the same dma_addr_t.
For example those classic sound devices for HDMI audio on graphics cards work like this. In other words if the device handled by the generic ALSA driver and the GPU are not in the same iommu domain you run into trouble.
Yes, I recall this weird AMD issue as well. IIRC the solution is not clean or "correct". :( I vaugely recall it was caused by a HW bug...
Well it might never been documented but I know of quite a bunch of different cases that assume that a DMA addr will just ultimately work for some other device/driver as well.
Again, illegal assumption, breaks the abstraction.
This is all DMA API internal details you shouldn't even be talking about at the DMA BUF level. It is all hidden and simply does not matter to DMA BUF at all.
Well we somehow need to support the existing use cases with the new API.
Call the DMA API multiple times, once per device. That is the only correct way to handle this today. DMABUF is already architected like this, each and every attach should be dma mapping and generating a scatterlist for every unique importing device.
Improving it to somehow avoid the redundant DMA API map would require new DMA API work.
Do NOT randomly assume that devices share dma_addr_t, there is no architected way to ever discover this, it is a complete violation of all the API abstractions.
If you want to pass an IOVA in one of these special driver-created domains then it would be some private address in DMABUF that only works on drivers that have understood they attached to these manually created domains. No DMA API involvement here.
That won't fly like this. That would break at least the ALSA use case and potentially quite a bunch of others.
Your AMD ALSA weirdness is not using custom iommu_domains (nor should it), it is a different problem.
dma-iommu.c chooses an IOVA alignment based on its own reasoning that is not always compatible with the HW. The HW can optimize if the IOVA alignment meets certain restrictions. Much like page tables in a GPU.
Yeah, but why can't we tell the DMA API those restrictions instead of letting the driver manage the address space themselves?
How do you propose to do this per-mapping operation without having the HW driver actually call the mapping operation?
Same as the multipath, the ATS, and more. It is all per-mapping descisions of the executing HW, not global decisions or something like.
So the DMA API has some structure or similar to describe the necessary per-mapping properties?
Not fully yet (though some multipath is supported), but I want to slowly move in this direction to solve all of these problems we have :(
Jason
linaro-mm-sig@lists.linaro.org