New subject: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

22 Jan 2025

      Am 22.01.25 um 12:04 schrieb Simona Vetter:
...
On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
...
On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
...
On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
...
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
...
On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
...
On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
What is going wrong with your email? You replied to Simona, but
Simona Vetter simona.vetter@ffwll.ch is dropped from the To/CC
list??? I added the address back, but seems like a weird thing to
happen.
Might also be funny mailing list stuff, depending how you get these. I
read mails over lore and pretty much ignore cc (unless it's not also on
any list, since those tend to be security issues) because I get cc'ed on
way too much stuff for that to be a useful signal.
Oh I see, you are sending a Mail-followup-to header that excludes your
address, so you don't get any emails at all.. My mutt is dropping you
as well.
I'm having all kind of funny phenomena with AMDs mail servers since 
coming back from xmas vacation.
From the news it looks like Outlook on Windows has a new major security 
issue where just viewing a mail can compromise the system and my 
educated guess is that our IT guys went into panic mode because of this 
and has changed something.
...
...
[SNIP]
I have been assuming that dmabuf mmap remains unchanged, that
exporters will continue to implement that mmap() callback as today.
That sounds really really good to me because that was my major concern 
when you noted that you want to have PFNs to build up KVM page tables.
But you don't want to handle mmap() on your own, you basically don't 
want to have a VMA for this stuff at all, correct?
...
...
My main interest has been what data structure is produced in the
attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical
list" that is some efficient coding of meta/addr/len tuples that works
well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have 
something like an iterator/cursor based approach similar to what the new 
DMA API is doing.
...
...
So, I imagine, struct dma_buf_attachment gaining an optional
feature negotiation and then we have in dma_buf_attachment:
     union {
           struct sg_table *sgt;
     struct physical_list *phyr;

};
That's basicaly it, an alternative to scatterlist that has a clean
architecture.
I would rather suggest something like dma_buf_attachment() gets offset 
and size to map and returns a cursor object you can use to get your 
address, length and access attributes.
And then you can iterate over this cursor and fill in your importer data 
structure with the necessary information.
This way neither the exporter nor the importer need to convert their 
data back and forth between their specific representations of the 
information.
...
...
Now, if you are asking if the current dmabuf mmap callback can be
improved with the above? Maybe? phyr should have the neccessary
information inside it to populate a VMA - eventually even fully
correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
See the exporter needs to be informed about page faults on the VMA to 
eventually wait for operations to end and sync caches.
Otherwise we either potentially allow access to freed up or re-used 
memory or run into issues with device cache coherency.
...
...
So, you could imagine that exporters could just have one routine to
generate the phyr list and that goes into the attachment, goes into
some common code to fill VMA PTEs, and some other common code that
will convert it into the DMABUF scatterlist. If performance is not a
concern with these data structure conversions it could be an appealing
simplification.
And yes, I could imagine the meta information being descriptive enough
to support the private interconnect cases, the common code could
detect private meta information and just cleanly fail.
I'm kinda leaning towards entirely separate dma-buf interfaces for the new
phyr stuff, because I fear that adding that to the existing ones will only
make the chaos worse. But that aside sounds all reasonable, and even that
could just be too much worry on my side and mixing phyr into existing
attachments (with a pile of importer/exporter flags probably) is fine.
I lean into the other direction.
Dmitry and Thomas have done a really good job at cleaning up all the 
interaction between dynamic and static exporters / importers.
Especially that we now have consistent locking for map_dma_buf() and 
unmap_dma_buf() should make that transition rather straight forward.
...
For the existing dma-buf importers/exporters I'm kinda hoping for a pure
dma_addr_t based list eventually. Going all the way to a phyr based
approach for everyone might be too much churn, there's some real bad cruft
there. It's not going to work for every case, but it covers a lot of them
and might be less pain for existing importers.
The point is we have use cases that won't work without exchanging DMA 
addresses any more.
For example we have cases with multiple devices are in the same IOMMU 
domain and re-using their DMA address mappings.
...
But in theory it should be possible to use phyr everywhere eventually, as
long as there's no obviously api-rules-breaking way to go from a phyr back to
a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
What we can do is to add an address space description to the addresses, 
e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device 
private bus addr or in the case of sharing with iommufd and KVM PFNs.
Regards,
Christian.
...
...
...
At least the device mapping / dma_buf_attachment
side should be doable with just the pfn and the new dma-api?
Yes, that would be my first goal post. Figure out some meta
information and a container data structure that allows struct
page-less P2P mapping through the new DMA API.
...
...
I'm hoping we can get to something where we describe not just how the
pfns should be DMA mapped, but also can describe how they should be
CPU mapped. For instance that this PFN space is always mapped
uncachable, in CPU and in IOMMU.
I was pondering whether dma_mmap and friends would be a good place to
prototype this and go for a fully generic implementation. But then even
those have _wc/_uncached variants.
Given that the inability to correctly DMA map P2P MMIO without struct
page is a current pain point and current source of hacks in dmabuf
exporters, I wanted to make resolving that a priority.
However, if you mean what I described above for "fully generic [dmabuf
mmap] implementation", then we'd have the phyr datastructure as a
dependency to attempt that work.
phyr, and particularly the meta information, has a number of
stakeholders. I was thinking of going first with rdma's memory
registration flow because we are now pretty close to being able to do
such a big change, and it can demonstrate most of the requirements.
But that doesn't mean mmap couldn't go concurrently on the same agreed
datastructure if people are interested.
Yeah cpu mmap needs a lot more, going with a very limited p2p use-case
first only makes sense.
...
...
...
We also have current bugs in the iommu/vfio side where we are fudging
CC stuff, like assuming CPU memory is encrypted (not always true) and
that MMIO is non-encrypted (not always true)
tbf CC pte flags I just don't grok at all. I've once tried to understand
what current exporters and gpu drivers do and just gave up. But that's
also a bit why I'm worried here because it's an enigma to me.
For CC, inside the secure world, is some information if each PFN
inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
pointing at the PFN must match the secure world's view of
'encrypted'. The VM can ask the secure world to change its view at
runtime.
The way CC has been bolted on to the kernel so far laregly hides this
from drivers, so it is troubled to tell in driver code if the PFN you
have is 'encrypted' or not. Right now the general rule (that is not
always true) is that struct page CPU memory is encrypted and
everything else is decrypted.
So right now, you can mostly ignore it and the above assumption
largely happens for you transparently.
However, soon we will have encrypted P2P MMIO which will stress this
hiding strategy.
It's already breaking with stuff like virtual gpu drivers, vmwgfx is
fiddling around with these bits (at least last I tried to understand this
all) and I think a few others do too.
...
...
...
...
I thought iommuv2 (or whatever linux calls these) has full fault support
and could support current move semantics. But yeah for iommu without
fault support we need some kind of pin or a newly formalized revoke model.
No, this is HW dependent, including PCI device, and I'm aware of no HW
that fully implements this in a way that could be useful to implement
arbitary move semantics for VFIO..
Hm I thought we've had at least prototypes floating around of device fault
repair, but I guess that only works with ATS/pasid stuff and not general
iommu traffic from devices. Definitely needs some device cooperation since
the timeouts of a full fault are almost endless.
Yes, exactly. What all real devices I'm aware have done is make a
subset of their traffic work with ATS and PRI, but not all their
traffic. Without *all* traffic you can't make any generic assumption
in the iommu that a transient non-present won't be fatal to the
device.
Stuff like dmabuf move semantics rely on transient non-present being
non-disruptive...
Ah now I get it, at the iommu level you have to pessimistically assume
whether a device can handle a fault, and none can for all traffic. I was
thinking too much about the driver level where generally the dma-buf you
importer are only used for the subset of device functions that can cope
with faults on many devices.
Cheers, Sima

Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI