Am 17.01.25 um 15:42 schrieb Simona Vetter:
On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote:
[SNIP]
Anything missing?
Well as far as I can see this use case is not a good fit for the DMA-buf interfaces in the first place. DMA-buf deals with devices and buffer exchange.
What's necessary here instead is to give an importing VM full access on some memory for their specific use case.
That full access includes CPU and DMA mappings, modifying caching attributes, potentially setting encryption keys for specific ranges etc.... etc...
In other words we have a lot of things the importer here should be able to do which we don't want most of the DMA-buf importers to do.
This proposal isn't about forcing existing exporters to allow importers to do new stuff. That stays as-is, because it would break things.
It's about adding yet another interface to get at the underlying data, and we have tons of those already. The only difference is that if we don't butcher the design, we'll be able to implement all the existing dma-buf interfaces on top of this new pfn interface, for some neat maximal compatibility.
That sounds like you missed my concern.
When an exporter and an importer agree that they want to exchange PFNs instead of DMA addresses then that is perfectly fine.
The problems start when you define the semantics how those PFNs, DMA address, private bus addresses, whatever is echanged different to what we have documented for DMA-buf.
This semantics is very well defined for DMA-buf now, because that is really important or otherwise things usually seem to work under testing (e.g. without memory pressure) and then totally fall apart in production environments.
In other words we have defined what lock you need to hold when calling functions, what a DMA fence is, when exchanged addresses are valid etc...
But fundamentally there's never been an expectation that you can take any arbitrary dma-buf and pass it any arbitrary importer, and that is must work. The fundamental promise is that if it _does_ work, then
- it's zero copy
- and fast, or as fast as we can make it
I don't see this any different than all the much more specific prposals and existing code, where a subset of importers/exporters have special rules so that e.g. gpu interconnect or vfio uuid based sharing works. pfn-based sharing is just yet another flavor that exists to get the max amount of speed out of interconnects.
Please take another look at what is proposed here. The function is called dma_buf_get_pfn_*unlocked* !
This is not following DMA-buf semantics for exchanging addresses and keeping them valid, but rather something more like userptrs.
Inserting PFNs into CPU (or probably also IOMMU) page tables have a different semantics than what DMA-buf usually does, because as soon as the address is written into the page table it is made public. So you need some kind of mechanism to make sure that this addr you made public stays valid as long as it is public.
The usual I/O operation we encapsulate with DMA-fences have a fundamentally different semantics because we have the lock which enforces that stuff stays valid and then have a DMA-fence which notes how long the stuff needs to stay valid for an operation to complete.
Regards, Christian.
Cheers, Sima
The semantics for things like pin vs revocable vs dynamic/moveable seems similar, but that's basically it.
As far as I know the TEE subsystem also represents their allocations as file descriptors. If I'm not completely mistaken this use case most likely fit's better there.
I feel like this is small enough that m-l archives is good enough. For some of the bigger projects we do in graphics we sometimes create entries in our kerneldoc with wip design consensus and things like that. But feels like overkill here.
My general desire is to move all of RDMA's MR process away from scatterlist and work using only the new DMA API. This will save *huge* amounts of memory in common workloads and be the basis for non-struct page DMA support, including P2P.
Yeah a more memory efficient structure than the scatterlist would be really nice. That would even benefit the very special dma-buf exporters where you cannot get a pfn and only the dma_addr_t, altough most of those (all maybe even?) have contig buffers, so your scatterlist has only one entry. But it would definitely be nice from a design pov.
Completely agree on that part.
Scatterlist have a some design flaws, especially mixing the input and out parameters of the DMA API into the same structure.
Additional to that DMA addresses are basically missing which bus they belong to and details how the access should be made (e.g. snoop vs no-snoop etc...).
Aside: A way to more efficiently create compressed scatterlists would be neat too, because a lot of drivers hand-roll that and it's a bit brittle and kinda silly to duplicate. With compressed I mean just a single entry for a contig range, in practice thanks to huge pages/folios and allocators trying to hand out contig ranges if there's plenty of memory that saves a lot of memory too. But currently it's a bit a pain to construct these efficiently, mostly it's just a two-pass approach and then trying to free surplus memory or krealloc to fit. Also I don't have good ideas here, but dma-api folks might have some from looking at too many things that create scatterlists.
I mailed with Christoph about that a while back as well and we both agreed that it would probably be a good idea to start defining a data structure to better encapsulate DMA addresses.
It's just that nobody had time for that yet and/or I wasn't looped in in the final discussion about it.
Regards, Christian.
-Sima
On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
What is going wrong with your email? You replied to Simona, but Simona Vetter simona.vetter@ffwll.ch is dropped from the To/CC list??? I added the address back, but seems like a weird thing to happen.
Please take another look at what is proposed here. The function is called dma_buf_get_pfn_*unlocked* !
I don't think Simona and I are defending the implementation in this series. This series needs work.
We have been talking about what the implementation should be. I think we've all been clear on the idea that the DMA buf locking rules should apply to any description of the memory, regardless of if the address are CPU, DMA, or private.
I agree that the idea of any "get unlocked" concept seems nonsensical and wrong within dmabuf.
Inserting PFNs into CPU (or probably also IOMMU) page tables have a different semantics than what DMA-buf usually does, because as soon as the address is written into the page table it is made public.
Not really.
The KVM/CPU is fully compatible with move semantics, it has restartable page faults and can implement dmabuf's move locking scheme. It can use the resv lock, the fences, move_notify and so on to implement it. It is a bug if this series isn't doing that.
The iommu cannot support move semantics. It would need the existing pin semantics (ie we call dma_buf_pin() and don't support move_notify). To work with VFIO we would need to formalize the revoke semantics that Simona was discussing.
We already implement both of these modalities in rdma, the locking API is fine and workable with CPU pfns just as well.
I've imagined a staged flow here:
1) The new DMA API lands 2) We improve the new DMA API to be fully struct page free, including setting up P2P 3) VFIO provides a dmbuf exporter using the new DMA API's P2P support. We'd have to continue with the scatterlist hacks for now. VFIO would be a move_notify exporter. This should work with RDMA 4) RDMA works to drop scatterlist from its internal flows and use the new DMA API instead. 5) VFIO/RDMA implement a new non-scatterlist DMABUF op to demonstrate the post-scatterlist world and deprecate the scatterlist hacks. 6) We add revoke semantics to dmabuf, and VFIO/RDMA implements them 7) iommufd can import a pinnable revokable dmabuf using CPU pfns through the non-scatterlist op. 8) Relevant GPU drivers implement the non-scatterlist op and RDMA removes support for the deprecated scatterlist hacks.
Xu's series has jumped ahead a bit and is missing infrastructure to build it properly.
Jason
linaro-mm-sig@lists.linaro.org