Am 22.01.25 um 12:04 schrieb Simona Vetter:
> On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
>> On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
>>> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
>>>> On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
>>>>> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
>>>>>> On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
>>>>>> What is going wrong with your email? You replied to Simona, but
>>>>>> Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
>>>>>> list??? I added the address back, but seems like a weird thing to
>>>>>> happen.
>>>>> Might also be funny mailing list stuff, depending how you get these. I
>>>>> read mails over lore and pretty much ignore cc (unless it's not also on
>>>>> any list, since those tend to be security issues) because I get cc'ed on
>>>>> way too much stuff for that to be a useful signal.
>>>> Oh I see, you are sending a Mail-followup-to header that excludes your
>>>> address, so you don't get any emails at all.. My mutt is dropping you
>>>> as well.
I'm having all kind of funny phenomena with AMDs mail servers since
coming back from xmas vacation.
From the news it looks like Outlook on Windows has a new major security
issue where just viewing a mail can compromise the system and my
educated guess is that our IT guys went into panic mode because of this
and has changed something.
>> [SNIP]
>> I have been assuming that dmabuf mmap remains unchanged, that
>> exporters will continue to implement that mmap() callback as today.
That sounds really really good to me because that was my major concern
when you noted that you want to have PFNs to build up KVM page tables.
But you don't want to handle mmap() on your own, you basically don't
want to have a VMA for this stuff at all, correct?
>> My main interest has been what data structure is produced in the
>> attach APIs.
>>
>> Eg today we have a struct dma_buf_attachment that returns a sg_table.
>>
>> I'm expecting some kind of new data structure, lets call it "physical
>> list" that is some efficient coding of meta/addr/len tuples that works
>> well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have
something like an iterator/cursor based approach similar to what the new
DMA API is doing.
>> So, I imagine, struct dma_buf_attachment gaining an optional
>> feature negotiation and then we have in dma_buf_attachment:
>>
>> union {
>> struct sg_table *sgt;
>> struct physical_list *phyr;
>> };
>>
>> That's basicaly it, an alternative to scatterlist that has a clean
>> architecture.
I would rather suggest something like dma_buf_attachment() gets offset
and size to map and returns a cursor object you can use to get your
address, length and access attributes.
And then you can iterate over this cursor and fill in your importer data
structure with the necessary information.
This way neither the exporter nor the importer need to convert their
data back and forth between their specific representations of the
information.
>> Now, if you are asking if the current dmabuf mmap callback can be
>> improved with the above? Maybe? phyr should have the neccessary
>> information inside it to populate a VMA - eventually even fully
>> correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
See the exporter needs to be informed about page faults on the VMA to
eventually wait for operations to end and sync caches.
Otherwise we either potentially allow access to freed up or re-used
memory or run into issues with device cache coherency.
>> So, you could imagine that exporters could just have one routine to
>> generate the phyr list and that goes into the attachment, goes into
>> some common code to fill VMA PTEs, and some other common code that
>> will convert it into the DMABUF scatterlist. If performance is not a
>> concern with these data structure conversions it could be an appealing
>> simplification.
>>
>> And yes, I could imagine the meta information being descriptive enough
>> to support the private interconnect cases, the common code could
>> detect private meta information and just cleanly fail.
> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse. But that aside sounds all reasonable, and even that
> could just be too much worry on my side and mixing phyr into existing
> attachments (with a pile of importer/exporter flags probably) is fine.
I lean into the other direction.
Dmitry and Thomas have done a really good job at cleaning up all the
interaction between dynamic and static exporters / importers.
Especially that we now have consistent locking for map_dma_buf() and
unmap_dma_buf() should make that transition rather straight forward.
> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually. Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.
The point is we have use cases that won't work without exchanging DMA
addresses any more.
For example we have cases with multiple devices are in the same IOMMU
domain and re-using their DMA address mappings.
> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
What we can do is to add an address space description to the addresses,
e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device
private bus addr or in the case of sharing with iommufd and KVM PFNs.
Regards,
Christian.
>>> At least the device mapping / dma_buf_attachment
>>> side should be doable with just the pfn and the new dma-api?
>> Yes, that would be my first goal post. Figure out some meta
>> information and a container data structure that allows struct
>> page-less P2P mapping through the new DMA API.
>>
>>>> I'm hoping we can get to something where we describe not just how the
>>>> pfns should be DMA mapped, but also can describe how they should be
>>>> CPU mapped. For instance that this PFN space is always mapped
>>>> uncachable, in CPU and in IOMMU.
>>> I was pondering whether dma_mmap and friends would be a good place to
>>> prototype this and go for a fully generic implementation. But then even
>>> those have _wc/_uncached variants.
>> Given that the inability to correctly DMA map P2P MMIO without struct
>> page is a current pain point and current source of hacks in dmabuf
>> exporters, I wanted to make resolving that a priority.
>>
>> However, if you mean what I described above for "fully generic [dmabuf
>> mmap] implementation", then we'd have the phyr datastructure as a
>> dependency to attempt that work.
>>
>> phyr, and particularly the meta information, has a number of
>> stakeholders. I was thinking of going first with rdma's memory
>> registration flow because we are now pretty close to being able to do
>> such a big change, and it can demonstrate most of the requirements.
>>
>> But that doesn't mean mmap couldn't go concurrently on the same agreed
>> datastructure if people are interested.
> Yeah cpu mmap needs a lot more, going with a very limited p2p use-case
> first only makes sense.
>
>>>> We also have current bugs in the iommu/vfio side where we are fudging
>>>> CC stuff, like assuming CPU memory is encrypted (not always true) and
>>>> that MMIO is non-encrypted (not always true)
>>> tbf CC pte flags I just don't grok at all. I've once tried to understand
>>> what current exporters and gpu drivers do and just gave up. But that's
>>> also a bit why I'm worried here because it's an enigma to me.
>> For CC, inside the secure world, is some information if each PFN
>> inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
>> pointing at the PFN must match the secure world's view of
>> 'encrypted'. The VM can ask the secure world to change its view at
>> runtime.
>>
>> The way CC has been bolted on to the kernel so far laregly hides this
>> from drivers, so it is troubled to tell in driver code if the PFN you
>> have is 'encrypted' or not. Right now the general rule (that is not
>> always true) is that struct page CPU memory is encrypted and
>> everything else is decrypted.
>>
>> So right now, you can mostly ignore it and the above assumption
>> largely happens for you transparently.
>>
>> However, soon we will have encrypted P2P MMIO which will stress this
>> hiding strategy.
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.
>
>>>>> I thought iommuv2 (or whatever linux calls these) has full fault support
>>>>> and could support current move semantics. But yeah for iommu without
>>>>> fault support we need some kind of pin or a newly formalized revoke model.
>>>> No, this is HW dependent, including PCI device, and I'm aware of no HW
>>>> that fully implements this in a way that could be useful to implement
>>>> arbitary move semantics for VFIO..
>>> Hm I thought we've had at least prototypes floating around of device fault
>>> repair, but I guess that only works with ATS/pasid stuff and not general
>>> iommu traffic from devices. Definitely needs some device cooperation since
>>> the timeouts of a full fault are almost endless.
>> Yes, exactly. What all real devices I'm aware have done is make a
>> subset of their traffic work with ATS and PRI, but not all their
>> traffic. Without *all* traffic you can't make any generic assumption
>> in the iommu that a transient non-present won't be fatal to the
>> device.
>>
>> Stuff like dmabuf move semantics rely on transient non-present being
>> non-disruptive...
> Ah now I get it, at the iommu level you have to pessimistically assume
> whether a device can handle a fault, and none can for all traffic. I was
> thinking too much about the driver level where generally the dma-buf you
> importer are only used for the subset of device functions that can cope
> with faults on many devices.
>
> Cheers, Sima
On Thu, Jan 23, 2025 at 03:41:58PM +0800, Xu Yilun wrote:
> I don't have a complete idea yet. But the goal is not to make any
> existing driver seamlessly work with secure device. It is to provide a
> generic way for bind/attestation/accept, and may save driver's effort
> if they don't care about this startup process. There are plenty of
> operations that a driver can't do to a secure device, FLR is one of
> them. The TDISP SPEC has described some general rules but some are even
> device specific.
You can FLR a secure device, it just has to be re-secured and
re-attested after. Otherwise no VFIO for you.
> So I think a driver (including VFIO) expects change to support trusted
> device, but may not have to cover bind/attestation/accept flow.
I expect changes, but not fundamental ones. VFIO will still have to
FLR devices as part of it's security architecture.
The entire flow needs to have options for drivers to be involved in
the flow, somehow.
Jason
On Wed, Jan 22, 2025 at 12:04:19PM +0100, Simona Vetter wrote:
> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse.
Lets see when some patches come up, if importers have to deal with
several formats a single attach interface would probably be simpler
for them..
> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually.
IMHO the least churn would be to have the old importers call a helper
(perhaps implicitly in the core code) to convert phyr into a dma
mapped scatterlist and then just keep using their existing code
exactly as is.
Performance improvement would come from importers switching to use the
new dma api internally and avoiding the scatterlist allocation, but
even the extra alocation is not so bad.
Then, perhaps, and I really hesitate to say this, but perhaps to ease
the migration we could store a dma mapped list in a phyr using the
meta information. That would allow a dmabuf scatterlist exporter to be
converted to a phyr with a helper. The importer would have to have a
special path to detect the dma mapped mode and skip the normal
mapping.
The point would be to let maintained drivers use the new data
structures and flows easily and still interoperate with older
drivers. Otherwise we have to somehow build parallel scatterlist and
phyr code flows into importers :\
> Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.
I'd hope we can reach every case.. But I don't know what kind of
horrors you guys have :)
> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.
I'd say getting a struct page is a perfectly safe operation, from a
phyr API perspective.
The dmabuf issue seems to be entirely about following the locking
rules, an importer is not allowed to get a struct page and switch from
reservation locking to page refcount locking.
I get the appeal for DRM of blocking struct page use because that
directly prevents the above. But, in RDMA I want to re-use the same
phyr datastructure for tracking pin_user_pages() CPU memory, and I
must get the struct page back so I can put_user_page() it when done.
Perhaps we can find some compromise where the phyr data structure has
some kind of flag 'disable struct page' and then the
phyr_entry_to_page() API will WARN_ON() or something like that.
> > However, soon we will have encrypted P2P MMIO which will stress this
> > hiding strategy.
>
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.
IMHO, most places I've seen touching this out side arch code feel very
hacky. :\
Jason
On Wed, Jan 22, 2025 at 12:32:56PM +0800, Xu Yilun wrote:
> On Tue, Jan 21, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> >
> > > When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> > > subsystem) has already moved the device to RUN state. So VFIO & DPDK
> > > are all TEE users, no need to manipulate TDISP state between them.
> > > AFAICS, this is the most preferred TIO usage in CoCo-VM.
> >
> > No, unfortunately. Part of the motivation to have the devices be
> > unlocked when the VM starts is because there is an expectation that a
> > driver in the VM will need to do untrusted operations to boot up the
>
> I assume these operations are device specific.
Yes
> > device before it can be switched to the run state.
> >
> > So any vfio use case needs to imagine that VFIO starts with an
> > untrusted device, does stuff to it, then pushes everything through to
>
> I have concern that VFIO has to do device specific stuff. Our current
> expectation is a specific device driver deals with the untrusted
> operations, then user writes a 'bind' device sysfs node which detaches
> the driver for untrusted, do the attestation and accept, and try match
> the driver for trusted (e.g. VFIO).
I don't see this as working, VFIO will FLR the device which will
destroy anything that was done prior.
VFIO itself has to do the sequence and the VFIO userspace has to
contain the device specific stuff.
The bind/unbind dance for untrusted->trusted would need to be
internalized in VFIO without unbinding. The main motivation for the
bind/unbind flow was to manage the DMA API, which VFIO does not use.
Jason
On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> subsystem) has already moved the device to RUN state. So VFIO & DPDK
> are all TEE users, no need to manipulate TDISP state between them.
> AFAICS, this is the most preferred TIO usage in CoCo-VM.
No, unfortunately. Part of the motivation to have the devices be
unlocked when the VM starts is because there is an expectation that a
driver in the VM will need to do untrusted operations to boot up the
device before it can be switched to the run state.
So any vfio use case needs to imagine that VFIO starts with an
untrusted device, does stuff to it, then pushes everything through to
run. The exact mirror as what a kernel driver should be able to do.
How exactly all this very complex stuff works, I have no idea, but
this is what I've understood is the target. :\
Jason
On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > > > What is going wrong with your email? You replied to Simona, but
> > > > Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
> > > > list??? I added the address back, but seems like a weird thing to
> > > > happen.
> > >
> > > Might also be funny mailing list stuff, depending how you get these. I
> > > read mails over lore and pretty much ignore cc (unless it's not also on
> > > any list, since those tend to be security issues) because I get cc'ed on
> > > way too much stuff for that to be a useful signal.
> >
> > Oh I see, you are sending a Mail-followup-to header that excludes your
> > address, so you don't get any emails at all.. My mutt is dropping you
> > as well.
> >
> > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> > > clear example that you can implement dma-buf mmap with the rules we have,
> > > except the unmap_mapping_range might need a bit fudging with a separate
> > > address_space.
> >
> > From my perspective the mmap thing is a bit of a side/DRM-only thing
> > as nothing I'm interested in wants to mmap dmabuf into a VMA.
>
> I guess we could just skip mmap on these pfn exporters, but it also means
> a bit more boilerplate.
I have been assuming that dmabuf mmap remains unchanged, that
exporters will continue to implement that mmap() callback as today.
My main interest has been what data structure is produced in the
attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical
list" that is some efficient coding of meta/addr/len tuples that works
well with the new DMA API. Matthew has been calling this thing phyr..
So, I imagine, struct dma_buf_attachment gaining an optional
feature negotiation and then we have in dma_buf_attachment:
union {
struct sg_table *sgt;
struct physical_list *phyr;
};
That's basicaly it, an alternative to scatterlist that has a clean
architecture.
Now, if you are asking if the current dmabuf mmap callback can be
improved with the above? Maybe? phyr should have the neccessary
information inside it to populate a VMA - eventually even fully
correctly with all the right cachable/encrypted/forbidden/etc flags.
So, you could imagine that exporters could just have one routine to
generate the phyr list and that goes into the attachment, goes into
some common code to fill VMA PTEs, and some other common code that
will convert it into the DMABUF scatterlist. If performance is not a
concern with these data structure conversions it could be an appealing
simplification.
And yes, I could imagine the meta information being descriptive enough
to support the private interconnect cases, the common code could
detect private meta information and just cleanly fail.
> At least the device mapping / dma_buf_attachment
> side should be doable with just the pfn and the new dma-api?
Yes, that would be my first goal post. Figure out some meta
information and a container data structure that allows struct
page-less P2P mapping through the new DMA API.
> > I'm hoping we can get to something where we describe not just how the
> > pfns should be DMA mapped, but also can describe how they should be
> > CPU mapped. For instance that this PFN space is always mapped
> > uncachable, in CPU and in IOMMU.
>
> I was pondering whether dma_mmap and friends would be a good place to
> prototype this and go for a fully generic implementation. But then even
> those have _wc/_uncached variants.
Given that the inability to correctly DMA map P2P MMIO without struct
page is a current pain point and current source of hacks in dmabuf
exporters, I wanted to make resolving that a priority.
However, if you mean what I described above for "fully generic [dmabuf
mmap] implementation", then we'd have the phyr datastructure as a
dependency to attempt that work.
phyr, and particularly the meta information, has a number of
stakeholders. I was thinking of going first with rdma's memory
registration flow because we are now pretty close to being able to do
such a big change, and it can demonstrate most of the requirements.
But that doesn't mean mmap couldn't go concurrently on the same agreed
datastructure if people are interested.
> > We also have current bugs in the iommu/vfio side where we are fudging
> > CC stuff, like assuming CPU memory is encrypted (not always true) and
> > that MMIO is non-encrypted (not always true)
>
> tbf CC pte flags I just don't grok at all. I've once tried to understand
> what current exporters and gpu drivers do and just gave up. But that's
> also a bit why I'm worried here because it's an enigma to me.
For CC, inside the secure world, is some information if each PFN
inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
pointing at the PFN must match the secure world's view of
'encrypted'. The VM can ask the secure world to change its view at
runtime.
The way CC has been bolted on to the kernel so far laregly hides this
from drivers, so it is troubled to tell in driver code if the PFN you
have is 'encrypted' or not. Right now the general rule (that is not
always true) is that struct page CPU memory is encrypted and
everything else is decrypted.
So right now, you can mostly ignore it and the above assumption
largely happens for you transparently.
However, soon we will have encrypted P2P MMIO which will stress this
hiding strategy.
> > > I thought iommuv2 (or whatever linux calls these) has full fault support
> > > and could support current move semantics. But yeah for iommu without
> > > fault support we need some kind of pin or a newly formalized revoke model.
> >
> > No, this is HW dependent, including PCI device, and I'm aware of no HW
> > that fully implements this in a way that could be useful to implement
> > arbitary move semantics for VFIO..
>
> Hm I thought we've had at least prototypes floating around of device fault
> repair, but I guess that only works with ATS/pasid stuff and not general
> iommu traffic from devices. Definitely needs some device cooperation since
> the timeouts of a full fault are almost endless.
Yes, exactly. What all real devices I'm aware have done is make a
subset of their traffic work with ATS and PRI, but not all their
traffic. Without *all* traffic you can't make any generic assumption
in the iommu that a transient non-present won't be fatal to the
device.
Stuff like dmabuf move semantics rely on transient non-present being
non-disruptive...
Jason
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > What is going wrong with your email? You replied to Simona, but
> > Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
> > list??? I added the address back, but seems like a weird thing to
> > happen.
>
> Might also be funny mailing list stuff, depending how you get these. I
> read mails over lore and pretty much ignore cc (unless it's not also on
> any list, since those tend to be security issues) because I get cc'ed on
> way too much stuff for that to be a useful signal.
Oh I see, you are sending a Mail-followup-to header that excludes your
address, so you don't get any emails at all.. My mutt is dropping you
as well.
> Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> clear example that you can implement dma-buf mmap with the rules we have,
> except the unmap_mapping_range might need a bit fudging with a separate
> address_space.
From my perspective the mmap thing is a bit of a side/DRM-only thing
as nothing I'm interested in wants to mmap dmabuf into a VMA.
However, I think if you have locking rules that can fit into a VMA
fault path and link move_notify to unmap_mapping_range() then you've
got a pretty usuable API.
> For cpu mmaps I'm more worried about the arch bits in the pte, stuff like
> caching mode or encrypted memory bits and things like that. There's
> vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to
> clean up that mess a bit.
I'm convinced we need meta-data along with pfns, there is too much
stuff that needs more information than just the address. Cachability,
CC encryption, exporting device, etc. This is a topic to partially
cross when we talk about how to fully remove struct page requirements
from the new DMA API.
I'm hoping we can get to something where we describe not just how the
pfns should be DMA mapped, but also can describe how they should be
CPU mapped. For instance that this PFN space is always mapped
uncachable, in CPU and in IOMMU.
We also have current bugs in the iommu/vfio side where we are fudging
CC stuff, like assuming CPU memory is encrypted (not always true) and
that MMIO is non-encrypted (not always true)
> I thought iommuv2 (or whatever linux calls these) has full fault support
> and could support current move semantics. But yeah for iommu without
> fault support we need some kind of pin or a newly formalized revoke model.
No, this is HW dependent, including PCI device, and I'm aware of no HW
that fully implements this in a way that could be useful to implement
arbitary move semantics for VFIO..
Jason
Am 17.01.25 um 15:42 schrieb Simona Vetter:
> On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote:
>> [SNIP]
>>> Anything missing?
>> Well as far as I can see this use case is not a good fit for the DMA-buf
>> interfaces in the first place. DMA-buf deals with devices and buffer
>> exchange.
>>
>> What's necessary here instead is to give an importing VM full access on some
>> memory for their specific use case.
>>
>> That full access includes CPU and DMA mappings, modifying caching
>> attributes, potentially setting encryption keys for specific ranges etc....
>> etc...
>>
>> In other words we have a lot of things the importer here should be able to
>> do which we don't want most of the DMA-buf importers to do.
> This proposal isn't about forcing existing exporters to allow importers to
> do new stuff. That stays as-is, because it would break things.
>
> It's about adding yet another interface to get at the underlying data, and
> we have tons of those already. The only difference is that if we don't
> butcher the design, we'll be able to implement all the existing dma-buf
> interfaces on top of this new pfn interface, for some neat maximal
> compatibility.
That sounds like you missed my concern.
When an exporter and an importer agree that they want to exchange PFNs
instead of DMA addresses then that is perfectly fine.
The problems start when you define the semantics how those PFNs, DMA
address, private bus addresses, whatever is echanged different to what
we have documented for DMA-buf.
This semantics is very well defined for DMA-buf now, because that is
really important or otherwise things usually seem to work under testing
(e.g. without memory pressure) and then totally fall apart in production
environments.
In other words we have defined what lock you need to hold when calling
functions, what a DMA fence is, when exchanged addresses are valid etc...
> But fundamentally there's never been an expectation that you can take any
> arbitrary dma-buf and pass it any arbitrary importer, and that is must
> work. The fundamental promise is that if it _does_ work, then
> - it's zero copy
> - and fast, or as fast as we can make it
>
> I don't see this any different than all the much more specific prposals
> and existing code, where a subset of importers/exporters have special
> rules so that e.g. gpu interconnect or vfio uuid based sharing works.
> pfn-based sharing is just yet another flavor that exists to get the max
> amount of speed out of interconnects.
Please take another look at what is proposed here. The function is
called dma_buf_get_pfn_*unlocked* !
This is not following DMA-buf semantics for exchanging addresses and
keeping them valid, but rather something more like userptrs.
Inserting PFNs into CPU (or probably also IOMMU) page tables have a
different semantics than what DMA-buf usually does, because as soon as
the address is written into the page table it is made public. So you
need some kind of mechanism to make sure that this addr you made public
stays valid as long as it is public.
The usual I/O operation we encapsulate with DMA-fences have a
fundamentally different semantics because we have the lock which
enforces that stuff stays valid and then have a DMA-fence which notes
how long the stuff needs to stay valid for an operation to complete.
Regards,
Christian.
>
> Cheers, Sima
>
>> The semantics for things like pin vs revocable vs dynamic/moveable seems
>> similar, but that's basically it.
>>
>> As far as I know the TEE subsystem also represents their allocations as file
>> descriptors. If I'm not completely mistaken this use case most likely fit's
>> better there.
>>
>>> I feel like this is small enough that m-l archives is good enough. For
>>> some of the bigger projects we do in graphics we sometimes create entries
>>> in our kerneldoc with wip design consensus and things like that. But
>>> feels like overkill here.
>>>
>>>> My general desire is to move all of RDMA's MR process away from
>>>> scatterlist and work using only the new DMA API. This will save *huge*
>>>> amounts of memory in common workloads and be the basis for non-struct
>>>> page DMA support, including P2P.
>>> Yeah a more memory efficient structure than the scatterlist would be
>>> really nice. That would even benefit the very special dma-buf exporters
>>> where you cannot get a pfn and only the dma_addr_t, altough most of those
>>> (all maybe even?) have contig buffers, so your scatterlist has only one
>>> entry. But it would definitely be nice from a design pov.
>> Completely agree on that part.
>>
>> Scatterlist have a some design flaws, especially mixing the input and out
>> parameters of the DMA API into the same structure.
>>
>> Additional to that DMA addresses are basically missing which bus they belong
>> to and details how the access should be made (e.g. snoop vs no-snoop
>> etc...).
>>
>>> Aside: A way to more efficiently create compressed scatterlists would be
>>> neat too, because a lot of drivers hand-roll that and it's a bit brittle
>>> and kinda silly to duplicate. With compressed I mean just a single entry
>>> for a contig range, in practice thanks to huge pages/folios and allocators
>>> trying to hand out contig ranges if there's plenty of memory that saves a
>>> lot of memory too. But currently it's a bit a pain to construct these
>>> efficiently, mostly it's just a two-pass approach and then trying to free
>>> surplus memory or krealloc to fit. Also I don't have good ideas here, but
>>> dma-api folks might have some from looking at too many things that create
>>> scatterlists.
>> I mailed with Christoph about that a while back as well and we both agreed
>> that it would probably be a good idea to start defining a data structure to
>> better encapsulate DMA addresses.
>>
>> It's just that nobody had time for that yet and/or I wasn't looped in in the
>> final discussion about it.
>>
>> Regards,
>> Christian.
>>
>>> -Sima
Am 21.06.24 um 00:02 schrieb Xu Yilun:
> On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
>> Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:
>>
>> On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
>>
>> Granted, let me try to improve this.
>> Here is a real world example of one of the issues we ran into and why
>> CPU mappings of importers are redirected to the exporter.
>> We have a good bunch of different exporters who track the CPU mappings
>> of their backing store using address_space objects in one way or
>> another and then uses unmap_mapping_range() to invalidate those CPU
>> mappings.
>> But when importers get the PFNs of the backing store they can look
>> behind the curtain and directly insert this PFN into the CPU page
>> tables.
>> We had literally tons of cases like this where drivers developers cause
>> access after free issues because the importer created a CPU mappings on
>> their own without the exporter knowing about it.
>> This is just one example of what we ran into. Additional to that
>> basically the whole synchronization between drivers was overhauled as
>> well because we found that we can't trust importers to always do the
>> right thing.
>>
>> But this, fundamentally, is importers creating attachments and then
>> *ignoring the lifetime rules of DMABUF*. If you created an attachment,
>> got a move and *ignored the move* because you put the PFN in your own
>> VMA, then you are not following the attachment lifetime rules!
>>
>> Move notify is solely for informing the importer that they need to
>> re-fresh their DMA mappings and eventually block for ongoing DMA to end.
>>
>> This semantics doesn't work well for CPU mappings because you need to hold
>> the reservation lock to make sure that the information stay valid and you
>> can't hold a lock while returning from a page fault.
> Dealing with CPU mapping and resource invalidation is a little hard, but is
> resolvable, by using other types of locks. And I guess for now dma-buf
> exporters should always handle this CPU mapping VS. invalidation contention if
> they support mmap().
>
> It is resolvable so with some invalidation notify, a decent importers could
> also handle the contention well.
That doesn't work like this.
See page tables updates under DMA-buf works by using the same locking
approach for both the validation and invalidation side. In other words
we hold the same lock while inserting and removing entries into/from the
page tables.
That this here should be an unlocked API means that can only use it with
pre-allocated and hard pinned memory without any chance to invalidate it
while running. Otherwise you can never be sure of the validity of the
address information you got from the exporter.
> IIUC now the only concern is importer device drivers are easier to do
> something wrong, so move CPU mapping things to exporter. But most of the
> exporters are also device drivers, why they are smarter?
Exporters always use their invalidation code path no matter if they are
exporting their buffers for other to use or if they are stand alone.
If you do the invalidation on the importer side you always need both
exporter and importer around to test it.
Additional to that we have much more importers than exporters. E.g. a
lot of simple drivers only import DMA-heap buffers and never exports
anything.
> And there are increasing mapping needs, today exporters help handle CPU primary
> mapping, tomorrow should they also help on all other mappings? Clearly it is
> not feasible. So maybe conditionally give trust to some importers.
Why should that be necessary? Exporters *must* know what somebody does
with their buffers.
If you have an use case the exporter doesn't support in their mapping
operation then that use case most likely doesn't work in the first place.
For example direct I/O is enabled/disabled by exporters on their CPU
mappings based on if that works correctly for them. And importer simply
doesn't know if they should use vm_insert_pfn() or vm_insert_page().
We could of course implement that logic into each importer to chose
between the different approaches, but than each importer gains logic it
only exercises with a specific exporter. And that doesn't seem to be a
good idea at all.
Regards,
Christian.
>
> Thanks,
> Yilun
Hi Jyothi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 55bcd2e0d04c1171d382badef1def1fd04ef66c5]
url: https://github.com/intel-lab-lkp/linux/commits/Jyothi-Kumar-Seerapu/dmaengi…
base: 55bcd2e0d04c1171d382badef1def1fd04ef66c5
patch link: https://lore.kernel.org/r/20250120095753.25539-3-quic_jseerapu%40quicinc.com
patch subject: [PATCH v5 2/2] i2c: i2c-qcom-geni: Add Block event interrupt support
config: arc-randconfig-001-20250120 (https://download.01.org/0day-ci/archive/20250120/202501202159.wLRVO16t-lkp@…)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250120/202501202159.wLRVO16t-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501202159.wLRVO16t-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/i2c/busses/i2c-qcom-geni.c:599: warning: Excess function parameter 'dev' description in 'geni_i2c_gpi_multi_desc_unmap'
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for OMAP2PLUS_MBOX
Depends on [n]: MAILBOX [=y] && (ARCH_OMAP2PLUS || ARCH_K3)
Selected by [m]:
- TI_K3_M4_REMOTEPROC [=m] && REMOTEPROC [=y] && (ARCH_K3 || COMPILE_TEST [=y])
vim +599 drivers/i2c/busses/i2c-qcom-geni.c
589
590 /**
591 * geni_i2c_gpi_multi_desc_unmap() - unmaps the buffers post multi message TX transfers
592 * @dev: pointer to the corresponding dev node
593 * @gi2c: i2c dev handle
594 * @msgs: i2c messages array
595 * @peripheral: pointer to the gpi_i2c_config
596 */
597 static void geni_i2c_gpi_multi_desc_unmap(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[],
598 struct gpi_i2c_config *peripheral)
> 599 {
600 u32 msg_xfer_cnt, wr_idx = 0;
601 struct geni_i2c_gpi_multi_desc_xfer *tx_multi_xfer = &gi2c->i2c_multi_desc_config;
602
603 /*
604 * In error case, need to unmap all messages based on the msg_idx_cnt.
605 * Non-error case unmap all the processed messages.
606 */
607 if (gi2c->err)
608 msg_xfer_cnt = tx_multi_xfer->msg_idx_cnt;
609 else
610 msg_xfer_cnt = tx_multi_xfer->irq_cnt * QCOM_I2C_GPI_NUM_MSGS_PER_IRQ;
611
612 /* Unmap the processed DMA buffers based on the received interrupt count */
613 for (; tx_multi_xfer->unmap_msg_cnt < msg_xfer_cnt; tx_multi_xfer->unmap_msg_cnt++) {
614 if (tx_multi_xfer->unmap_msg_cnt == gi2c->num_msgs)
615 break;
616 wr_idx = tx_multi_xfer->unmap_msg_cnt % QCOM_I2C_GPI_MAX_NUM_MSGS;
617 geni_i2c_gpi_unmap(gi2c, &msgs[tx_multi_xfer->unmap_msg_cnt],
618 tx_multi_xfer->dma_buf[wr_idx],
619 tx_multi_xfer->dma_addr[wr_idx],
620 NULL, (dma_addr_t)NULL);
621 tx_multi_xfer->freed_msg_cnt++;
622 }
623 }
624
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki