On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > > > What is going wrong with your email? You replied to Simona, but
> > > > Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
> > > > list??? I added the address back, but seems like a weird thing to
> > > > happen.
> > >
> > > Might also be funny mailing list stuff, depending how you get these. I
> > > read mails over lore and pretty much ignore cc (unless it's not also on
> > > any list, since those tend to be security issues) because I get cc'ed on
> > > way too much stuff for that to be a useful signal.
> >
> > Oh I see, you are sending a Mail-followup-to header that excludes your
> > address, so you don't get any emails at all.. My mutt is dropping you
> > as well.
> >
> > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> > > clear example that you can implement dma-buf mmap with the rules we have,
> > > except the unmap_mapping_range might need a bit fudging with a separate
> > > address_space.
> >
> > From my perspective the mmap thing is a bit of a side/DRM-only thing
> > as nothing I'm interested in wants to mmap dmabuf into a VMA.
>
> I guess we could just skip mmap on these pfn exporters, but it also means
> a bit more boilerplate.
I have been assuming that dmabuf mmap remains unchanged, that
exporters will continue to implement that mmap() callback as today.
My main interest has been what data structure is produced in the
attach APIs.
Eg today we have a struct dma_buf_attachment that returns a sg_table.
I'm expecting some kind of new data structure, lets call it "physical
list" that is some efficient coding of meta/addr/len tuples that works
well with the new DMA API. Matthew has been calling this thing phyr..
So, I imagine, struct dma_buf_attachment gaining an optional
feature negotiation and then we have in dma_buf_attachment:
union {
struct sg_table *sgt;
struct physical_list *phyr;
};
That's basicaly it, an alternative to scatterlist that has a clean
architecture.
Now, if you are asking if the current dmabuf mmap callback can be
improved with the above? Maybe? phyr should have the neccessary
information inside it to populate a VMA - eventually even fully
correctly with all the right cachable/encrypted/forbidden/etc flags.
So, you could imagine that exporters could just have one routine to
generate the phyr list and that goes into the attachment, goes into
some common code to fill VMA PTEs, and some other common code that
will convert it into the DMABUF scatterlist. If performance is not a
concern with these data structure conversions it could be an appealing
simplification.
And yes, I could imagine the meta information being descriptive enough
to support the private interconnect cases, the common code could
detect private meta information and just cleanly fail.
> At least the device mapping / dma_buf_attachment
> side should be doable with just the pfn and the new dma-api?
Yes, that would be my first goal post. Figure out some meta
information and a container data structure that allows struct
page-less P2P mapping through the new DMA API.
> > I'm hoping we can get to something where we describe not just how the
> > pfns should be DMA mapped, but also can describe how they should be
> > CPU mapped. For instance that this PFN space is always mapped
> > uncachable, in CPU and in IOMMU.
>
> I was pondering whether dma_mmap and friends would be a good place to
> prototype this and go for a fully generic implementation. But then even
> those have _wc/_uncached variants.
Given that the inability to correctly DMA map P2P MMIO without struct
page is a current pain point and current source of hacks in dmabuf
exporters, I wanted to make resolving that a priority.
However, if you mean what I described above for "fully generic [dmabuf
mmap] implementation", then we'd have the phyr datastructure as a
dependency to attempt that work.
phyr, and particularly the meta information, has a number of
stakeholders. I was thinking of going first with rdma's memory
registration flow because we are now pretty close to being able to do
such a big change, and it can demonstrate most of the requirements.
But that doesn't mean mmap couldn't go concurrently on the same agreed
datastructure if people are interested.
> > We also have current bugs in the iommu/vfio side where we are fudging
> > CC stuff, like assuming CPU memory is encrypted (not always true) and
> > that MMIO is non-encrypted (not always true)
>
> tbf CC pte flags I just don't grok at all. I've once tried to understand
> what current exporters and gpu drivers do and just gave up. But that's
> also a bit why I'm worried here because it's an enigma to me.
For CC, inside the secure world, is some information if each PFN
inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
pointing at the PFN must match the secure world's view of
'encrypted'. The VM can ask the secure world to change its view at
runtime.
The way CC has been bolted on to the kernel so far laregly hides this
from drivers, so it is troubled to tell in driver code if the PFN you
have is 'encrypted' or not. Right now the general rule (that is not
always true) is that struct page CPU memory is encrypted and
everything else is decrypted.
So right now, you can mostly ignore it and the above assumption
largely happens for you transparently.
However, soon we will have encrypted P2P MMIO which will stress this
hiding strategy.
> > > I thought iommuv2 (or whatever linux calls these) has full fault support
> > > and could support current move semantics. But yeah for iommu without
> > > fault support we need some kind of pin or a newly formalized revoke model.
> >
> > No, this is HW dependent, including PCI device, and I'm aware of no HW
> > that fully implements this in a way that could be useful to implement
> > arbitary move semantics for VFIO..
>
> Hm I thought we've had at least prototypes floating around of device fault
> repair, but I guess that only works with ATS/pasid stuff and not general
> iommu traffic from devices. Definitely needs some device cooperation since
> the timeouts of a full fault are almost endless.
Yes, exactly. What all real devices I'm aware have done is make a
subset of their traffic work with ATS and PRI, but not all their
traffic. Without *all* traffic you can't make any generic assumption
in the iommu that a transient non-present won't be fatal to the
device.
Stuff like dmabuf move semantics rely on transient non-present being
non-disruptive...
Jason
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > What is going wrong with your email? You replied to Simona, but
> > Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
> > list??? I added the address back, but seems like a weird thing to
> > happen.
>
> Might also be funny mailing list stuff, depending how you get these. I
> read mails over lore and pretty much ignore cc (unless it's not also on
> any list, since those tend to be security issues) because I get cc'ed on
> way too much stuff for that to be a useful signal.
Oh I see, you are sending a Mail-followup-to header that excludes your
address, so you don't get any emails at all.. My mutt is dropping you
as well.
> Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> clear example that you can implement dma-buf mmap with the rules we have,
> except the unmap_mapping_range might need a bit fudging with a separate
> address_space.
From my perspective the mmap thing is a bit of a side/DRM-only thing
as nothing I'm interested in wants to mmap dmabuf into a VMA.
However, I think if you have locking rules that can fit into a VMA
fault path and link move_notify to unmap_mapping_range() then you've
got a pretty usuable API.
> For cpu mmaps I'm more worried about the arch bits in the pte, stuff like
> caching mode or encrypted memory bits and things like that. There's
> vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to
> clean up that mess a bit.
I'm convinced we need meta-data along with pfns, there is too much
stuff that needs more information than just the address. Cachability,
CC encryption, exporting device, etc. This is a topic to partially
cross when we talk about how to fully remove struct page requirements
from the new DMA API.
I'm hoping we can get to something where we describe not just how the
pfns should be DMA mapped, but also can describe how they should be
CPU mapped. For instance that this PFN space is always mapped
uncachable, in CPU and in IOMMU.
We also have current bugs in the iommu/vfio side where we are fudging
CC stuff, like assuming CPU memory is encrypted (not always true) and
that MMIO is non-encrypted (not always true)
> I thought iommuv2 (or whatever linux calls these) has full fault support
> and could support current move semantics. But yeah for iommu without
> fault support we need some kind of pin or a newly formalized revoke model.
No, this is HW dependent, including PCI device, and I'm aware of no HW
that fully implements this in a way that could be useful to implement
arbitary move semantics for VFIO..
Jason
Am 17.01.25 um 15:42 schrieb Simona Vetter:
> On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote:
>> [SNIP]
>>> Anything missing?
>> Well as far as I can see this use case is not a good fit for the DMA-buf
>> interfaces in the first place. DMA-buf deals with devices and buffer
>> exchange.
>>
>> What's necessary here instead is to give an importing VM full access on some
>> memory for their specific use case.
>>
>> That full access includes CPU and DMA mappings, modifying caching
>> attributes, potentially setting encryption keys for specific ranges etc....
>> etc...
>>
>> In other words we have a lot of things the importer here should be able to
>> do which we don't want most of the DMA-buf importers to do.
> This proposal isn't about forcing existing exporters to allow importers to
> do new stuff. That stays as-is, because it would break things.
>
> It's about adding yet another interface to get at the underlying data, and
> we have tons of those already. The only difference is that if we don't
> butcher the design, we'll be able to implement all the existing dma-buf
> interfaces on top of this new pfn interface, for some neat maximal
> compatibility.
That sounds like you missed my concern.
When an exporter and an importer agree that they want to exchange PFNs
instead of DMA addresses then that is perfectly fine.
The problems start when you define the semantics how those PFNs, DMA
address, private bus addresses, whatever is echanged different to what
we have documented for DMA-buf.
This semantics is very well defined for DMA-buf now, because that is
really important or otherwise things usually seem to work under testing
(e.g. without memory pressure) and then totally fall apart in production
environments.
In other words we have defined what lock you need to hold when calling
functions, what a DMA fence is, when exchanged addresses are valid etc...
> But fundamentally there's never been an expectation that you can take any
> arbitrary dma-buf and pass it any arbitrary importer, and that is must
> work. The fundamental promise is that if it _does_ work, then
> - it's zero copy
> - and fast, or as fast as we can make it
>
> I don't see this any different than all the much more specific prposals
> and existing code, where a subset of importers/exporters have special
> rules so that e.g. gpu interconnect or vfio uuid based sharing works.
> pfn-based sharing is just yet another flavor that exists to get the max
> amount of speed out of interconnects.
Please take another look at what is proposed here. The function is
called dma_buf_get_pfn_*unlocked* !
This is not following DMA-buf semantics for exchanging addresses and
keeping them valid, but rather something more like userptrs.
Inserting PFNs into CPU (or probably also IOMMU) page tables have a
different semantics than what DMA-buf usually does, because as soon as
the address is written into the page table it is made public. So you
need some kind of mechanism to make sure that this addr you made public
stays valid as long as it is public.
The usual I/O operation we encapsulate with DMA-fences have a
fundamentally different semantics because we have the lock which
enforces that stuff stays valid and then have a DMA-fence which notes
how long the stuff needs to stay valid for an operation to complete.
Regards,
Christian.
>
> Cheers, Sima
>
>> The semantics for things like pin vs revocable vs dynamic/moveable seems
>> similar, but that's basically it.
>>
>> As far as I know the TEE subsystem also represents their allocations as file
>> descriptors. If I'm not completely mistaken this use case most likely fit's
>> better there.
>>
>>> I feel like this is small enough that m-l archives is good enough. For
>>> some of the bigger projects we do in graphics we sometimes create entries
>>> in our kerneldoc with wip design consensus and things like that. But
>>> feels like overkill here.
>>>
>>>> My general desire is to move all of RDMA's MR process away from
>>>> scatterlist and work using only the new DMA API. This will save *huge*
>>>> amounts of memory in common workloads and be the basis for non-struct
>>>> page DMA support, including P2P.
>>> Yeah a more memory efficient structure than the scatterlist would be
>>> really nice. That would even benefit the very special dma-buf exporters
>>> where you cannot get a pfn and only the dma_addr_t, altough most of those
>>> (all maybe even?) have contig buffers, so your scatterlist has only one
>>> entry. But it would definitely be nice from a design pov.
>> Completely agree on that part.
>>
>> Scatterlist have a some design flaws, especially mixing the input and out
>> parameters of the DMA API into the same structure.
>>
>> Additional to that DMA addresses are basically missing which bus they belong
>> to and details how the access should be made (e.g. snoop vs no-snoop
>> etc...).
>>
>>> Aside: A way to more efficiently create compressed scatterlists would be
>>> neat too, because a lot of drivers hand-roll that and it's a bit brittle
>>> and kinda silly to duplicate. With compressed I mean just a single entry
>>> for a contig range, in practice thanks to huge pages/folios and allocators
>>> trying to hand out contig ranges if there's plenty of memory that saves a
>>> lot of memory too. But currently it's a bit a pain to construct these
>>> efficiently, mostly it's just a two-pass approach and then trying to free
>>> surplus memory or krealloc to fit. Also I don't have good ideas here, but
>>> dma-api folks might have some from looking at too many things that create
>>> scatterlists.
>> I mailed with Christoph about that a while back as well and we both agreed
>> that it would probably be a good idea to start defining a data structure to
>> better encapsulate DMA addresses.
>>
>> It's just that nobody had time for that yet and/or I wasn't looped in in the
>> final discussion about it.
>>
>> Regards,
>> Christian.
>>
>>> -Sima
Am 21.06.24 um 00:02 schrieb Xu Yilun:
> On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
>> Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:
>>
>> On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
>>
>> Granted, let me try to improve this.
>> Here is a real world example of one of the issues we ran into and why
>> CPU mappings of importers are redirected to the exporter.
>> We have a good bunch of different exporters who track the CPU mappings
>> of their backing store using address_space objects in one way or
>> another and then uses unmap_mapping_range() to invalidate those CPU
>> mappings.
>> But when importers get the PFNs of the backing store they can look
>> behind the curtain and directly insert this PFN into the CPU page
>> tables.
>> We had literally tons of cases like this where drivers developers cause
>> access after free issues because the importer created a CPU mappings on
>> their own without the exporter knowing about it.
>> This is just one example of what we ran into. Additional to that
>> basically the whole synchronization between drivers was overhauled as
>> well because we found that we can't trust importers to always do the
>> right thing.
>>
>> But this, fundamentally, is importers creating attachments and then
>> *ignoring the lifetime rules of DMABUF*. If you created an attachment,
>> got a move and *ignored the move* because you put the PFN in your own
>> VMA, then you are not following the attachment lifetime rules!
>>
>> Move notify is solely for informing the importer that they need to
>> re-fresh their DMA mappings and eventually block for ongoing DMA to end.
>>
>> This semantics doesn't work well for CPU mappings because you need to hold
>> the reservation lock to make sure that the information stay valid and you
>> can't hold a lock while returning from a page fault.
> Dealing with CPU mapping and resource invalidation is a little hard, but is
> resolvable, by using other types of locks. And I guess for now dma-buf
> exporters should always handle this CPU mapping VS. invalidation contention if
> they support mmap().
>
> It is resolvable so with some invalidation notify, a decent importers could
> also handle the contention well.
That doesn't work like this.
See page tables updates under DMA-buf works by using the same locking
approach for both the validation and invalidation side. In other words
we hold the same lock while inserting and removing entries into/from the
page tables.
That this here should be an unlocked API means that can only use it with
pre-allocated and hard pinned memory without any chance to invalidate it
while running. Otherwise you can never be sure of the validity of the
address information you got from the exporter.
> IIUC now the only concern is importer device drivers are easier to do
> something wrong, so move CPU mapping things to exporter. But most of the
> exporters are also device drivers, why they are smarter?
Exporters always use their invalidation code path no matter if they are
exporting their buffers for other to use or if they are stand alone.
If you do the invalidation on the importer side you always need both
exporter and importer around to test it.
Additional to that we have much more importers than exporters. E.g. a
lot of simple drivers only import DMA-heap buffers and never exports
anything.
> And there are increasing mapping needs, today exporters help handle CPU primary
> mapping, tomorrow should they also help on all other mappings? Clearly it is
> not feasible. So maybe conditionally give trust to some importers.
Why should that be necessary? Exporters *must* know what somebody does
with their buffers.
If you have an use case the exporter doesn't support in their mapping
operation then that use case most likely doesn't work in the first place.
For example direct I/O is enabled/disabled by exporters on their CPU
mappings based on if that works correctly for them. And importer simply
doesn't know if they should use vm_insert_pfn() or vm_insert_page().
We could of course implement that logic into each importer to chose
between the different approaches, but than each importer gains logic it
only exercises with a specific exporter. And that doesn't seem to be a
good idea at all.
Regards,
Christian.
>
> Thanks,
> Yilun
Hi Jyothi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 55bcd2e0d04c1171d382badef1def1fd04ef66c5]
url: https://github.com/intel-lab-lkp/linux/commits/Jyothi-Kumar-Seerapu/dmaengi…
base: 55bcd2e0d04c1171d382badef1def1fd04ef66c5
patch link: https://lore.kernel.org/r/20250120095753.25539-3-quic_jseerapu%40quicinc.com
patch subject: [PATCH v5 2/2] i2c: i2c-qcom-geni: Add Block event interrupt support
config: arc-randconfig-001-20250120 (https://download.01.org/0day-ci/archive/20250120/202501202159.wLRVO16t-lkp@…)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250120/202501202159.wLRVO16t-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501202159.wLRVO16t-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/i2c/busses/i2c-qcom-geni.c:599: warning: Excess function parameter 'dev' description in 'geni_i2c_gpi_multi_desc_unmap'
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for OMAP2PLUS_MBOX
Depends on [n]: MAILBOX [=y] && (ARCH_OMAP2PLUS || ARCH_K3)
Selected by [m]:
- TI_K3_M4_REMOTEPROC [=m] && REMOTEPROC [=y] && (ARCH_K3 || COMPILE_TEST [=y])
vim +599 drivers/i2c/busses/i2c-qcom-geni.c
589
590 /**
591 * geni_i2c_gpi_multi_desc_unmap() - unmaps the buffers post multi message TX transfers
592 * @dev: pointer to the corresponding dev node
593 * @gi2c: i2c dev handle
594 * @msgs: i2c messages array
595 * @peripheral: pointer to the gpi_i2c_config
596 */
597 static void geni_i2c_gpi_multi_desc_unmap(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[],
598 struct gpi_i2c_config *peripheral)
> 599 {
600 u32 msg_xfer_cnt, wr_idx = 0;
601 struct geni_i2c_gpi_multi_desc_xfer *tx_multi_xfer = &gi2c->i2c_multi_desc_config;
602
603 /*
604 * In error case, need to unmap all messages based on the msg_idx_cnt.
605 * Non-error case unmap all the processed messages.
606 */
607 if (gi2c->err)
608 msg_xfer_cnt = tx_multi_xfer->msg_idx_cnt;
609 else
610 msg_xfer_cnt = tx_multi_xfer->irq_cnt * QCOM_I2C_GPI_NUM_MSGS_PER_IRQ;
611
612 /* Unmap the processed DMA buffers based on the received interrupt count */
613 for (; tx_multi_xfer->unmap_msg_cnt < msg_xfer_cnt; tx_multi_xfer->unmap_msg_cnt++) {
614 if (tx_multi_xfer->unmap_msg_cnt == gi2c->num_msgs)
615 break;
616 wr_idx = tx_multi_xfer->unmap_msg_cnt % QCOM_I2C_GPI_MAX_NUM_MSGS;
617 geni_i2c_gpi_unmap(gi2c, &msgs[tx_multi_xfer->unmap_msg_cnt],
618 tx_multi_xfer->dma_buf[wr_idx],
619 tx_multi_xfer->dma_addr[wr_idx],
620 NULL, (dma_addr_t)NULL);
621 tx_multi_xfer->freed_msg_cnt++;
622 }
623 }
624
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Mon, Jun 24, 2024 at 03:59:53AM +0800, Xu Yilun wrote:
> > But it also seems to me that VFIO should be able to support putting
> > the device into the RUN state
>
> Firstly I think VFIO should support putting device into *LOCKED* state.
> From LOCKED to RUN, there are many evidence fetching and attestation
> things that only guest cares. I don't think VFIO needs to opt-in.
VFIO is not just about running VMs. If someone wants to run DPDK on
VFIO they should be able to get the device into a RUN state and work
with secure memory without requiring a KVM. Yes there are many steps
to this, but we should imagine how it can work.
> > without involving KVM or cVMs.
>
> It may not be feasible for all vendors.
It must be. A CC guest with an in kernel driver can definately get the
PCI device into RUN, so VFIO running in the guest should be able as
well.
> I believe AMD would have one firmware call that requires cVM handle
> *AND* move device into LOCKED state. It really depends on firmware
> implementation.
IMHO, you would not use the secure firmware if you are not using VMs.
> Yes, the secure EPT is in the secure world and managed by TDX firmware.
> Now a SW Mirror Secure EPT is introduced in KVM and managed by KVM
> directly, and KVM will finally use firmware calls to propagate Mirror
> Secure EPT changes to secure EPT.
If the secure world managed it then the secure world can have rules
that work with the IOMMU as well..
Jason
On Fri, Jan 17, 2025 at 09:57:40AM +0800, Baolu Lu wrote:
> On 1/15/25 21:01, Jason Gunthorpe wrote:
> > On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> > > On 15/1/25 00:35, Jason Gunthorpe wrote:
> > > > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > > >
> > > > > > is needed so the secure world can prepare anything it needs prior to
> > > > > > starting the VM.
> > > > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > > > >
> > > > > Maybe we could move to Dan's thread for discussion.
> > > > >
> > > > > https://lore.kernel.org/linux-
> > > > > coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-
> > > > > xfh.jf.intel.com/
> > > > I think Dan's series is different, any uapi from that series should
> > > > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > > > use. I would expect VFIO to be calling some of that infrastructure.
> > > Something like this experiment?
> > >
> > > https://github.com/aik/linux/commit/
> > > ce052512fb8784e19745d4cb222e23cabc57792e
> > Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
> > hosting those APIs, the above does seem to be a reasonable direction.
> >
> > When the various fds are closed I would expect the kernel to unbind
> > and restore the device back.
>
> I am curious about the value of tsm binding against an iomnufd_vdevice
> instead of the physical iommufd_device.
Interesting question
> It is likely that the kvm pointer should be passed to iommufd during the
> creation of a viommu object.
Yes, I fully expect this
> If my recollection is correct, the arm
> smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
Right now it will use a VMID unrelated to KVM. BTM support on ARM will
require syncing the VMID with KVM.
AMD and Intel may require the KVM for some reason as well.
For CC I'm expecting the KVM fd to be the handle for the cVM, so any
RPCs that want to call into the secure world need the KVM FD to get
the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
information and the cVM's handle.
From that perspective it does make sense that any cVM related APIs,
like "bind to cVM" would be against the VDEVICE where we have a link
to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is
part of the object hierarchy, but does not necessarily have to force a
vIOMMU to appear in the cVM.
But it also seems to me that VFIO should be able to support putting
the device into the RUN state without involving KVM or cVMs.
> Intel TDX connect implementation also needs a reference to the kvm
> pointer to obtain the secure EPT information. This is crucial because
> the CPU's page table must be shared with the iommu.
I thought kvm folks were NAKing this sharing entirely? Or is the
secure EPT in the secure world and not directly managed by Linux?
AFAIK AMD is going to mirror the iommu page table like today.
ARM, I suspect, will not have an "EPT" under Linux control, so
whatever happens will be hidden in their secure world.
Jason
Am 08.01.25 um 20:22 schrieb Xu Yilun:
> On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
>> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
>>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
>>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
>>>>> I have imagined a staged approach were DMABUF gets a new API that
>>>>> works with the new DMA API to do importer mapping with "P2P source
>>>>> information" and a gradual conversion.
>>>> To make it clear as maintainer of that subsystem I would reject such a step
>>>> with all I have.
>>> This is unexpected, so you want to just leave dmabuf broken? Do you
>>> have any plan to fix it, to fix the misuse of the DMA API, and all
>>> the problems I listed below? This is a big deal, it is causing real
>>> problems today.
>>>
>>> If it going to be like this I think we will stop trying to use dmabuf
>>> and do something simpler for vfio/kvm/iommufd :(
>> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
>> pfn isn't a terrible idea. By design, dma-buf is the "everything is
>> optional" interface. And in the beginning, even consistent locking was
>> optional, but we've managed to fix that by now :-/
Well you were also the person who mangled the struct page pointers in
the scatterlist because people were abusing this and getting a bloody
nose :)
>> Where I do agree with Christian is that stuffing pfn support into the
>> dma_buf_attachment interfaces feels a bit much wrong.
> So it could a dmabuf interface like mmap/vmap()? I was also wondering
> about that. But finally I start to use dma_buf_attachment interface
> because of leveraging existing buffer pin and move_notify.
Exactly that's the point, sharing pfn doesn't work with the pin and
move_notify interfaces because of the MMU notifier approach Sima mentioned.
>>>> We have already gone down that road and it didn't worked at all and
>>>> was a really big pain to pull people back from it.
>>> Nobody has really seriously tried to improve the DMA API before, so I
>>> don't think this is true at all.
>> Aside, I really hope this finally happens!
Sorry my fault. I was not talking about the DMA API, but rather that
people tried to look behind the curtain of DMA-buf backing stores.
In other words all the fun we had with scatterlists and that people try
to modify the struct pages inside of them.
Improving the DMA API is something I really really hope for as well.
>>>>> 3) Importing devices need to know if they are working with PCI P2P
>>>>> addresses during mapping because they need to do things like turn on
>>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
>>>>> today that assume DMABUFs are always P2P because we cannot determine
>>>>> if things are P2P or not after being DMA mapped.
>>>> Why would you need ATS on PCI P2P and not for system memory accesses?
>>> ATS has a significant performance cost. It is mandatory for PCI P2P,
>>> but ideally should be avoided for CPU memory.
>> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
>> stuff a bit I guess ...
Hui? Why should ATS be mandatory for PCI P2P?
We have tons of production systems using PCI P2P without ATS. And it's
the first time I hear that.
>>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No
>>>>> exporter mapping is possible
>>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly
>>>> confirm that this isn't true.
>>> Today they are mmaping the dma-buf into a VMA and then using KVM's
>>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
>>> dma-buf must have a CPU PFN.
>>>
>>> Here Xu implements basically the same path, except without the VMA
>>> indirection, and it suddenly not OK? Illogical.
>> So the big difference is that for follow_pfn() you need mmu_notifier since
>> the mmap might move around, whereas with pfn smashed into
>> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
>> callback if you go dynamic.
>>
>> So I guess my first question is, which locking rules do you want here for
>> pfn importers?
> follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
As Sima explained you either have follow_pfn() and mmu_notifier or you
have DMA addresses and dma_resv lock / dma_fence.
Just giving out PFNs without some lifetime associated with them is one
of the major problems we faced before and really not something you can do.
>> If mmu notifiers is fine, then I think the current approach of follow_pfn
>> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>> somehow is an issue itself), then I think the clean design is create a new
> cpu mmap() is an issue, this series is aimed to eliminate userspace
> mapping for private MMIO resources.
Why?
>> separate access mechanism just for that. It would be the 5th or so (kernel
>> vmap, userspace mmap, dma_buf_attach and driver private stuff like
>> virtio_dma_buf.c where you access your buffer with a uuid), so really not
>> a big deal.
> OK, will think more about that.
Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN
with MMIO mappings and P2P. And that required exactly zero DMA-buf
changes :)
I don't fully understand your use case, but I think it's quite likely
that we already have that working.
Regards,
Christian.
>
> Thanks,
> Yilun
>
>> And for non-contrived exporters we might be able to implement the other
>> access methods in terms of the pfn method generically, so this wouldn't
>> even be a terrible maintenance burden going forward. And meanwhile all the
>> contrived exporters just keep working as-is.
>>
>> The other part is that cpu mmap is optional, and there's plenty of strange
>> exporters who don't implement. But you can dma map the attachment into
>> plenty devices. This tends to mostly be a thing on SoC devices with some
>> very funky memory. But I guess you don't care about these use-case, so
>> should be ok.
>>
>> I couldn't come up with a good name for these pfn users, maybe
>> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
>> some of these new p2p source specifiers (or a list of those which are
>> allowed, no idea how this would need to fit into the new dma api).
>>
>> Cheers, Sima
>> --
>> Simona Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
Am 16.01.25 um 02:46 schrieb Zhaoyang Huang:
> On Wed, Jan 15, 2025 at 7:49 PM Christian König
> <christian.koenig(a)amd.com> wrote:
>> Am 15.01.25 um 07:18 schrieb zhaoyang.huang:
>>> From: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com>
>>>
>>> When using dma-buf as memory pool for VMM. The vmf_insert_pfn will
>>> apply PTE_SPECIAL on pte which have vm_normal_page report bad_pte and
>>> return NULL. This commit would like to suggest to replace
>>> vmf_insert_pfn by vmf_insert_page.
>> Setting PTE_SPECIAL is completely intentional here to prevent
>> get_user_pages() from working on DMA-buf mappings.
> ok. May I ask the reason?
Drivers using this interface own the backing store for their specific
use cases. There are a couple of things get_user_pages(),
pin_user_pages(), direct I/O etc.. do which usually clash with those use
cases. So that is intentionally completely disabled.
We have the possibility to create a DMA-buf from memfd object and you
can then do direct I/O to the memfd and still use the DMA-buf with GPUs
or V4L for example.
>> So absolutely clear NAK to this patch here.
>>
>> What exactly are you trying to do?
> I would like to have pkvm have guest kernel be faulted of its second
> stage page fault(ARM64's memory virtualization method) on dma-buf
> which use pin_user_pages.
Yeah, exactly that's one of the use case which we intentionally prevent
here.
The backing store drivers use don't care about the pin count of the
memory and happily give it back to memory pools and/or swap it with
device local memory if necessary.
When this happens the ARM VM wouldn't be informed of the change and
potentially accesses the wrong address.
So sorry, but this approach won't work.
You could try with the memfd+DMA-buf approach I mentioned earlier, but
that won't give you all functionality on all DMA-buf supporting devices.
For example GPUs usually can't scan out to a monitor from such buffers
because of hardware limitations.
Regards,
Christian.
>> Regards,
>> Christian.
>>
>>> [ 103.402787] kvm [5276]: gfn(ipa)=0x80000 hva=0x7d4a400000 write_fault=0
>>> [ 103.403822] BUG: Bad page map in process crosvm_vcpu0 pte:168000140000f43 pmd:8000000c1ca0003
>>> [ 103.405144] addr:0000007d4a400000 vm_flags:040400fb anon_vma:0000000000000000 mapping:ffffff8085163df0 index:0
>>> [ 103.406536]file:dmabuf fault:cma_heap_vm_fault [cma_heap] mmap:dma_buf_mmap_internal read_folio:0x0
>>> [ 103.407877] CPU: 3 PID: 5276 Comm: crosvm_vcpu0 Tainted: G W OE 6.6.46-android15-8-g8bab72b63c20-dirty-4k #1 1e474a12dac4553a3ebba3a911f3b744176a5d2d
>>> [ 103.409818] Hardware name: Unisoc UMS9632-base Board (DT)
>>> [ 103.410613] Call trace:
>>> [ 103.411038] dump_backtrace+0xf4/0x140
>>> [ 103.411641] show_stack+0x20/0x30
>>> [ 103.412184] dump_stack_lvl+0x60/0x84
>>> [ 103.412766] dump_stack+0x18/0x24
>>> [ 103.413304] print_bad_pte+0x1b8/0x1cc
>>> [ 103.413909] vm_normal_page+0xc8/0xd0
>>> [ 103.414491] follow_page_pte+0xb0/0x304
>>> [ 103.415096] follow_page_mask+0x108/0x240
>>> [ 103.415721] __get_user_pages+0x168/0x4ac
>>> [ 103.416342] __gup_longterm_locked+0x15c/0x864
>>> [ 103.417023] pin_user_pages+0x70/0xcc
>>> [ 103.417609] pkvm_mem_abort+0xf8/0x5c0
>>> [ 103.418207] kvm_handle_guest_abort+0x3e0/0x3e4
>>> [ 103.418906] handle_exit+0xac/0x33c
>>> [ 103.419472] kvm_arch_vcpu_ioctl_run+0x48c/0x8d8
>>> [ 103.420176] kvm_vcpu_ioctl+0x504/0x5bc
>>> [ 103.420785] __arm64_sys_ioctl+0xb0/0xec
>>> [ 103.421401] invoke_syscall+0x60/0x11c
>>> [ 103.422000] el0_svc_common+0xb4/0xe8
>>> [ 103.422590] do_el0_svc+0x24/0x30
>>> [ 103.423131] el0_svc+0x3c/0x70
>>> [ 103.423640] el0t_64_sync_handler+0x68/0xbc
>>> [ 103.424288] el0t_64_sync+0x1a8/0x1ac
>>>
>>> Signed-off-by: Xiwei Wang<xiwei.wang1(a)unisoc.com>
>>> Signed-off-by: Aijun Sun<aijun.sun(a)unisoc.com>
>>> Signed-off-by: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com>
>>> ---
>>> drivers/dma-buf/heaps/cma_heap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
>>> index c384004b918e..b301fb63f16b 100644
>>> --- a/drivers/dma-buf/heaps/cma_heap.c
>>> +++ b/drivers/dma-buf/heaps/cma_heap.c
>>> @@ -168,7 +168,7 @@ static vm_fault_t cma_heap_vm_fault(struct vm_fault *vmf)
>>> if (vmf->pgoff > buffer->pagecount)
>>> return VM_FAULT_SIGBUS;
>>>
>>> - return vmf_insert_pfn(vma, vmf->address, page_to_pfn(buffer->pages[vmf->pgoff]));
>>> + return vmf_insert_page(vma, vmf->address, buffer->pages[vmf->pgoff]);
>>> }
>>>
>>> static const struct vm_operations_struct dma_heap_vm_ops = {
On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
> I think for 90% of exporters pfn would fit, but there's some really funny
> ones where you cannot get a cpu pfn by design. So we need to keep the
> pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
> have helpers/common code that just implements all the other interfaces.
There is no way to have dma address without a PFN in Linux right now.
How would you generate them? That implies you have an IOMMU that can
generate IOVAs for something that doesn't have a physical address at
all.
Or do you mean some that don't have pages associated with them, and
thus have pfn_valid fail on them? They still have a PFN, just not
one that is valid to use in most of the Linux MM.
On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> On 15/1/25 00:35, Jason Gunthorpe wrote:
> > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> >
> > > > is needed so the secure world can prepare anything it needs prior to
> > > > starting the VM.
> > >
> > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > >
> > > Maybe we could move to Dan's thread for discussion.
> > >
> > > https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548…
> >
> > I think Dan's series is different, any uapi from that series should
> > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > use. I would expect VFIO to be calling some of that infrastructure.
>
> Something like this experiment?
>
> https://github.com/aik/linux/commit/ce052512fb8784e19745d4cb222e23cabc57792e
Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
hosting those APIs, the above does seem to be a reasonable direction.
When the various fds are closed I would expect the kernel to unbind
and restore the device back.
Jason
Am 15.01.25 um 09:55 schrieb Simona Vetter:
>>> If we add something
>>> new, we need clear rules and not just "here's the kvm code that uses it".
>>> That's how we've done dma-buf at first, and it was a terrible mess of
>>> mismatched expecations.
>> Yes, that would be wrong. It should be self defined within dmabuf and
>> kvm should adopt to it, move semantics and all.
> Ack.
>
> I feel like we have a plan here.
I think I have to object a bit on that.
> Summary from my side:
>
> - Sort out pin vs revocable vs dynamic/moveable semantics, make sure
> importers have no surprises.
>
> - Adopt whatever new dma-api datastructures pops out of the dma-api
> reworks.
>
> - Add pfn based memory access as yet another optional access method, with
> helpers so that exporters who support this get all the others for free.
>
> I don't see a strict ordering between these, imo should be driven by
> actual users of the dma-buf api.
>
> Already done:
>
> - dmem cgroup so that we can resource control device pinnings just landed
> in drm-next for next merge window. So that part is imo sorted and we can
> charge ahead with pinning into device memory without all the concerns
> we've had years ago when discussing that for p2p dma-buf support.
>
> But there might be some work so that we can p2p pin without requiring
> dynamic attachments only, I haven't checked whether that needs
> adjustment in dma-buf.c code or just in exporters.
>
> Anything missing?
Well as far as I can see this use case is not a good fit for the DMA-buf
interfaces in the first place. DMA-buf deals with devices and buffer
exchange.
What's necessary here instead is to give an importing VM full access on
some memory for their specific use case.
That full access includes CPU and DMA mappings, modifying caching
attributes, potentially setting encryption keys for specific ranges
etc.... etc...
In other words we have a lot of things the importer here should be able
to do which we don't want most of the DMA-buf importers to do.
The semantics for things like pin vs revocable vs dynamic/moveable seems
similar, but that's basically it.
As far as I know the TEE subsystem also represents their allocations as
file descriptors. If I'm not completely mistaken this use case most
likely fit's better there.
> I feel like this is small enough that m-l archives is good enough. For
> some of the bigger projects we do in graphics we sometimes create entries
> in our kerneldoc with wip design consensus and things like that. But
> feels like overkill here.
>
>> My general desire is to move all of RDMA's MR process away from
>> scatterlist and work using only the new DMA API. This will save *huge*
>> amounts of memory in common workloads and be the basis for non-struct
>> page DMA support, including P2P.
> Yeah a more memory efficient structure than the scatterlist would be
> really nice. That would even benefit the very special dma-buf exporters
> where you cannot get a pfn and only the dma_addr_t, altough most of those
> (all maybe even?) have contig buffers, so your scatterlist has only one
> entry. But it would definitely be nice from a design pov.
Completely agree on that part.
Scatterlist have a some design flaws, especially mixing the input and
out parameters of the DMA API into the same structure.
Additional to that DMA addresses are basically missing which bus they
belong to and details how the access should be made (e.g. snoop vs
no-snoop etc...).
> Aside: A way to more efficiently create compressed scatterlists would be
> neat too, because a lot of drivers hand-roll that and it's a bit brittle
> and kinda silly to duplicate. With compressed I mean just a single entry
> for a contig range, in practice thanks to huge pages/folios and allocators
> trying to hand out contig ranges if there's plenty of memory that saves a
> lot of memory too. But currently it's a bit a pain to construct these
> efficiently, mostly it's just a two-pass approach and then trying to free
> surplus memory or krealloc to fit. Also I don't have good ideas here, but
> dma-api folks might have some from looking at too many things that create
> scatterlists.
I mailed with Christoph about that a while back as well and we both
agreed that it would probably be a good idea to start defining a data
structure to better encapsulate DMA addresses.
It's just that nobody had time for that yet and/or I wasn't looped in in
the final discussion about it.
Regards,
Christian.
> -Sima
On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:
> E.g. if a compositor gets a dma-buf it assumes that by just binding that
> it will not risk gpu context destruction (unless you're out of memory and
> everything is on fire anyway, and it's ok to die). But if a nasty client
> app supplies a revocable dma-buf, then it can shot down the higher
> priviledged compositor gpu workload with precision. Which is not great, so
> maybe existing dynamic gpu importers should reject revocable dma-buf.
> That's at least what I had in mind as a potential issue.
I see, so it is not that they can't handle a non-present fault it is
just that the non-present effectively turns into a crash of the
context and you want to avoid the crash. It makes sense to me to
negotiate this as part of the API.
> > This is similar to the structure BIO has, and it composes nicely with
> > a future pin_user_pages() and memfd_pin_folios().
>
> Since you mention pin here, I think that's another aspect of the revocable
> vs dynamic question. Dynamic buffers are expected to sometimes just move
> around for no reason, and importers must be able to cope.
Yes, and we have importers that can tolerate dynamic and those that
can't. Though those that can't tolerate it can often implement revoke.
I view your list as a cascade:
1) Fully pinned can never be changed so long as the attach is present
2) Fully pinned, but can be revoked. Revoked is a fatal condition and
the importer is allowed to experience an error
3) Fully dynamic and always present. Support for move, and
restartable fault, is required
Today in RDMA we ask the exporter if it is 1 or 3 and allow different
things. I've seen the GPU side start to offer 1 more often as it has
significant performance wins.
> For recovable exporters/importers I'd expect that movement is not
> happening, meaning it's pinned until the single terminal revocation. And
> maybe I read the kvm stuff wrong, but it reads more like the latter to me
> when crawling through the pfn code.
kvm should be fully faultable and it should be able handle move. It
handles move today using the mmu notifiers after all.
kvm would need to interact with the dmabuf reservations on its page
fault path.
iommufd cannot be faultable and it would only support revoke. For VFIO
revoke would not be fully terminal as VFIO can unrevoke too
(sigh). If we make revoke special I'd like to eventually include
unrevoke for this reason.
> Once we have the lifetime rules nailed then there's the other issue of how
> to describe the memory, and my take for that is that once the dma-api has
> a clear answer we'll just blindly adopt that one and done.
This is what I hope, we are not there yet, first Leon's series needs
to get merged then we can start on making the DMA API P2P safe without
any struct page. From there it should be clear what direction things
go in.
DMABUF would return pfns annotated with whatever matches the DMA API,
and the importer would be able to inspect the PFNs to learn
information like their P2Pness, CPU mappability or whatever.
I'm pushing for the extra struct, and Christoph has been thinking
about searching a maple tree on the PFN. We need to see what works best.
> And currently with either dynamic attachments and dma_addr_t or through
> fishing the pfn from the cpu pagetables there's some very clearly defined
> lifetime and locking rules (which kvm might get wrong, I've seen some
> discussions fly by where it wasn't doing a perfect job with reflecting pte
> changes, but that was about access attributes iirc).
Wouldn't surprise me, mmu notifiers are very complex all around. We've
had bugs already where the mm doesn't signal the notifiers at the
right points.
> If we add something
> new, we need clear rules and not just "here's the kvm code that uses it".
> That's how we've done dma-buf at first, and it was a terrible mess of
> mismatched expecations.
Yes, that would be wrong. It should be self defined within dmabuf and
kvm should adopt to it, move semantics and all.
My general desire is to move all of RDMA's MR process away from
scatterlist and work using only the new DMA API. This will save *huge*
amounts of memory in common workloads and be the basis for non-struct
page DMA support, including P2P.
Jason
On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > is needed so the secure world can prepare anything it needs prior to
> > starting the VM.
>
> OK. From Dan's patchset there are some touch point for vendor tsm
> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>
> Maybe we could move to Dan's thread for discussion.
>
> https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548…
I think Dan's series is different, any uapi from that series should
not be used in the VMM case. We need proper vfio APIs for the VMM to
use. I would expect VFIO to be calling some of that infrastructure.
Really, I don't see a clear sense of how this will look yet. AMD
provided some patches along these lines, I have not seem ARM and Intel
proposals yet, not do I sense there is alignment.
> > Setting up secure vIOMMU emulation, for instance. I
>
> I think this could be done at VM late bind time.
The vIOMMU needs to be setup before the VM boots
> > secure. This should all be pre-arranged as possible before starting
>
> But our current implementation is not to prepare as much as possible,
> but only necessary, so most of the secure work for vPCI function is done
> at late bind time.
That's fine too, but both options need to be valid.
Jason
On Sat, Jan 11, 2025 at 11:48:06AM +0800, Xu Yilun wrote:
> > > > can be sure what is the correct UAPI. In other words, make the
> > > > VFIO device into a CC device should also prevent mmaping it and so on.
> > >
> > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
> >
> > I think you need to start the TDI process much earlier. Some arches
> > are going to need work to prepare the TDI before the VM is started.
>
> Could you elaborate more on that? AFAICS Intel & AMD are all good on
> "late bind", but not sure for other architectures.
I'm not sure about this, the topic has been confused a bit, and people
often seem to misunderstand what the full scenario actually is. :\
What I'm talking abou there is that you will tell the secure world to
create vPCI function that has the potential to be secure "TDI run"
down the road. The VM will decide when it reaches the run state. This
is needed so the secure world can prepare anything it needs prior to
starting the VM. Setting up secure vIOMMU emulation, for instance. I
expect ARM will need this, I'd be surprised if AMD actually doesn't in
the full scenario with secure viommu.
It should not be a surprise to the secure world after the VM has
started that suddenly it learns about a vPCI function that wants to be
secure. This should all be pre-arranged as possible before starting
the VM, even if alot of steps happen after the VM starts running (or
maybe don't happen at all).
Jason
On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
> So if I'm getting this right, what you need from a functional pov is a
> dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
> is not going to work I guess?
Don't want something TDX specific!
There is a general desire, and CC is one, but there are other
motivations like performance, to stop using VMAs and mmaps as a way to
exchanage memory between two entities. Instead we want to use FDs.
We now have memfd and guestmemfd that are usable with
memfd_pin_folios() - this covers pinnable CPU memory.
And for a long time we had DMABUF which is for all the other wild
stuff, and it supports movable memory too.
So, the normal DMABUF semantics with reservation locking and move
notifiers seem workable to me here. They are broadly similar enough to
the mmu notifier locking that they can serve the same job of updating
page tables.
> Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> memory model:
> - permanently pinned dma-buf, they never move
> - dynamic dma-buf, they move through ->move_notify and importers can remap
> - revocable dma-buf, which thus far only exist for pci mmio resources
I would like to see the importers be able to discover which one is
going to be used, because we have RDMA cases where we can support 1
and 3 but not 2.
revocable doesn't require page faulting as it is a terminal condition.
> Since we're leaning even more on that 3rd model I'm wondering whether we
> should make it something official. Because the existing dynamic importers
> do very much assume that re-acquiring the memory after move_notify will
> work. But for the revocable use-case the entire point is that it will
> never work.
> I feel like that's a concept we need to make explicit, so that dynamic
> importers can reject such memory if necessary.
It strikes me as strange that HW can do page faulting, so it can
support #2, but it can't handle a non-present fault?
> So yeah there's a bunch of tricky lifetime questions that need to be
> sorted out with proper design I think, and the current "let's just use pfn
> directly" proposal hides them all under the rug.
I don't think these two things are connected. The lifetime model that
KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
definately should be reviewed and evaluated.
But it is completely orthogonal to allowing iommufd and kvm to access
the CPU PFN to use in their mapping flows, instead of the
dma_addr_t.
What I want to get to is a replacement for scatter list in DMABUF that
is an array of arrays, roughly like:
struct memory_chunks {
struct memory_p2p_provider *provider;
struct bio_vec addrs[];
};
int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);
This can represent all forms of memory: P2P, private, CPU, etc and
would be efficient with the new DMA API.
This is similar to the structure BIO has, and it composes nicely with
a future pin_user_pages() and memfd_pin_folios().
Jason
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote:
> On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > > exporter mapping is possible
> > > >
> > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > > confirm that this isn't true.
> > >
> > > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > > dma-buf must have a CPU PFN.
> >
> > Yes, the final target for KVM is still the CPU PFN, just with the help
> > of CPU mapping table.
> >
> > I also found the xen gntdev-dmabuf is calculating pfn from mapped
> > sgt.
>
> See the comment, it's ok because it's a fake device with fake iommu and
> the xen code has special knowledge to peek behind the curtain.
/*
* Now convert sgt to array of gfns without accessing underlying pages.
* It is not allowed to access the underlying struct page of an sg table
* exported by DMA-buf, but since we deal with special Xen dma device here
* (not a normal physical one) look at the dma addresses in the sg table
* and then calculate gfns directly from them.
*/
for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr)));
*barf*
Can we please all agree that is a horrible abuse of the DMA API and
lets not point it as some acceptable "solution"? KVM and iommufd do
not have fake struct devices with fake iommus.
Jason
On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote:
> So then we face with the shared <-> private device conversion in CoCo VM,
> and in turn shared <-> private MMIO conversion. MMIO region has only one
> physical backend so it is a bit like in-place conversion which is
> complicated. I wanna simply the MMIO conversion routine based on the fact
> that VMM never needs to access assigned MMIO for feature emulation, so
> always disallow userspace MMIO mapping during the whole lifecycle. That's
> why the flag is introduced.
The VMM can simply not map it if for these cases. As part of the TDI
flow the kernel can validate it is not mapped.
> > can be sure what is the correct UAPI. In other words, make the
> > VFIO device into a CC device should also prevent mmaping it and so on.
>
> My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
I think you need to start the TDI process much earlier. Some arches
are going to need work to prepare the TDI before the VM is started.
The other issue here is that Intel is somewhat different from others
and when we build uapi for TDI it has to accommodate everyone.
> Yes. It carries out the idea of "KVM maps MMIO resources without firstly
> mapping into the host" even for normal VM. That's why I think it could
> be an independent patchset.
Yes, just remove this patch and other TDI focused stuff. Just
infrastructure to move to FD based mapping instead of VMA.
Jason
On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > for private assignment. For these private assigned devices, disallow
> > > host accessing their MMIO resources.
> >
> > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > to enforce this?
>
> MM.. maybe I should not say 'host', instead 'userspace'.
>
> I think the kernel part VMM (KVM) has the responsibility to enforce the
> correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> touch private memory/MMIO intentionally or accidently. IIUC that's one
> of the initiative guest_memfd is introduced for private memory. Private
> MMIO follows.
Okay, but then why is it a flag like that? I'm expecting a much
broader system here to make the VFIO device into a confidential device
(like setup the TDI) where we'd have to enforce the private things,
communicate with some secure world to assign it, and so on.
I want to see a fuller solution to the CC problem in VFIO before we
can be sure what is the correct UAPI. In other words, make the
VFIO device into a CC device should also prevent mmaping it and so on.
So, I would take this out and defer VFIO enforcment to a series which
does fuller CC enablement of VFIO.
The precursor work should just be avoiding requiring a VMA when
installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
using a FD to get the CPU pfns into iommufd and kvm as you are
showing.
This works just fine for non-CC devices anyhow and is the necessary
building block for making a TDI interface in VFIO.
Jason
Am 09.01.25 um 14:37 schrieb Philipp Stanner:
> From: Philipp Stanner <pstanner(a)redhat.com>
>
> drm_sched_backend_ops.run_job() returns a dma_fence for the scheduler.
> That fence is signalled by the driver once the hardware completed the
> associated job. The scheduler does not increment the reference count on
> that fence, but implicitly expects to inherit this fence from run_job().
>
> This is relatively subtle and prone to misunderstandings.
>
> This implies that, to keep a reference for itself, a driver needs to
> call dma_fence_get() in addition to dma_fence_init() in that callback.
>
> It's further complicated by the fact that the scheduler even decrements
> the refcount in drm_sched_run_job_work() since it created a new
> reference in drm_sched_fence_scheduled(). It does, however, still use
> its pointer to the fence after calling dma_fence_put() - which is safe
> because of the aforementioned new reference, but actually still violates
> the refcounting rules.
>
> Improve the explanatory comment for that decrement.
>
> Move the call to dma_fence_put() to the position behind the last usage
> of the fence.
>
> Document the necessity to increment the reference count in
> drm_sched_backend_ops.run_job().
>
> Signed-off-by: Philipp Stanner <pstanner(a)redhat.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++---
> include/drm/gpu_scheduler.h | 19 +++++++++++++++----
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 57da84908752..5f46c01eb01e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1218,15 +1218,19 @@ static void drm_sched_run_job_work(struct work_struct *w)
> drm_sched_fence_scheduled(s_fence, fence);
>
> if (!IS_ERR_OR_NULL(fence)) {
> - /* Drop for original kref_init of the fence */
> - dma_fence_put(fence);
> -
> r = dma_fence_add_callback(fence, &sched_job->cb,
> drm_sched_job_done_cb);
> if (r == -ENOENT)
> drm_sched_job_done(sched_job, fence->error);
> else if (r)
> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r);
> +
> + /*
> + * s_fence took a new reference to fence in the call to
> + * drm_sched_fence_scheduled() above. The reference passed by
> + * run_job() above is now not needed any longer. Drop it.
> + */
> + dma_fence_put(fence);
> } else {
> drm_sched_job_done(sched_job, IS_ERR(fence) ?
> PTR_ERR(fence) : 0);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 95e17504e46a..d5cd2a78f27c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -420,10 +420,21 @@ struct drm_sched_backend_ops {
> struct drm_sched_entity *s_entity);
>
> /**
> - * @run_job: Called to execute the job once all of the dependencies
> - * have been resolved. This may be called multiple times, if
> - * timedout_job() has happened and drm_sched_job_recovery()
> - * decides to try it again.
> + * @run_job: Called to execute the job once all of the dependencies
> + * have been resolved. This may be called multiple times, if
> + * timedout_job() has happened and drm_sched_job_recovery() decides to
> + * try it again.
I just came to realize that this hack with calling run_job multiple
times won't work any more with this patch here.
The background is that you can't allocate memory for a newly returned
fence and as far as I know no driver pre allocates multiple HW fences
for a job.
So at least amdgpu used to re-use the same HW fence as return over and
over again, just re-initialized the reference count. I removed that hack
from amdgpu, but just FYI it could be that other driver did the same.
Apart from that concern I think that this patch is really the right
thing and that driver hacks relying on the order of dropping references
are fundamentally broken approaches.
So feel free to add Reviewed-by: Christian König <christian.koenig(a)amd.com>.
Regards,
Christian.
> + *
> + * @sched_job: the job to run
> + *
> + * Returns: dma_fence the driver must signal once the hardware has
> + * completed the job ("hardware fence").
> + *
> + * Note that the scheduler expects to 'inherit' its own reference to
> + * this fence from the callback. It does not invoke an extra
> + * dma_fence_get() on it. Consequently, this callback must take a
> + * reference for the scheduler, and additional ones for the driver's
> + * respective needs.
> */
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better
specify in which domain they are applicable, e.g. if they are PCIe bus
addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into
the importer is an absolutely clear NO-GO.
Regards,
Christian.
> So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu(a)linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
> drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
> include/linux/dma-buf.h | 13 ++++++
> 2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> size_t alloc_size = sizeof(struct dma_buf);
> int ret;
>
> - if (WARN_ON(!exp_info->priv || !exp_info->ops
> - || !exp_info->ops->map_dma_buf
> - || !exp_info->ops->unmap_dma_buf
> - || !exp_info->ops->release))
> + if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> + !exp_info->ops->release))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> struct dma_buf_attachment *attach;
> int ret;
>
> - if (WARN_ON(!dmabuf || !dev))
> + if (WARN_ON(!dmabuf))
> + return ERR_PTR(-EINVAL);
> +
> + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> */
> if (dma_buf_attachment_is_dynamic(attach) !=
> dma_buf_is_dynamic(dmabuf)) {
> - struct sg_table *sgt;
> + struct sg_table *sgt = NULL;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> goto err_unlock;
> }
>
> - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> - if (!sgt)
> - sgt = ERR_PTR(-ENOMEM);
> - if (IS_ERR(sgt)) {
> - ret = PTR_ERR(sgt);
> - goto err_unpin;
> + if (dmabuf->ops->map_dma_buf) {
> + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> + if (!sgt)
> + sgt = ERR_PTR(-ENOMEM);
> + if (IS_ERR(sgt)) {
> + ret = PTR_ERR(sgt);
> + goto err_unpin;
> + }
> }
> +
> dma_resv_unlock(attach->dmabuf->resv);
> attach->sgt = sgt;
> attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> }
> EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach: [in] attachment to get pfn from
> + * @pgoff: [in] page offset of the buffer against the start of dma_buf
> + * @pfn: [out] returns the pfn of the buffer
> + * @max_order [out] returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> + struct dma_buf *dmabuf = attach->dmabuf;
> + int ret;
> +
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->get_pfn))
> + return -EINVAL;
> +
> + /*
> + * Open:
> + *
> + * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> + * should be pinned before use, See dma_buf_map_attachment() for
> + * reference.
> + *
> + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> + * need another API to unpin the dma_buf. So just fail out this case.
> + */
> + if (dma_buf_is_dynamic(attach->dmabuf) &&
> + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> + return -ENOENT;
> +
> + dma_resv_lock(attach->dmabuf->resv, NULL);
> + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> + /*
> + * Open:
> + *
> + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> + * content synchronization could be done when the buffer is to be
> + * mapped by importer.
> + */
> + dma_resv_unlock(attach->dmabuf->resv);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
> /**
> * dma_buf_move_notify - notify attachments that DMA-buf is moving
> *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
> attach_count = 0;
>
> list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
> attach_count++;
> }
> dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
> * if the call would block.
> */
>
> + /**
> + * @get_pfn:
> + *
> + * This is called by dma_buf_get_pfn(). It is used to get the pfn
> + * of the buffer positioned by the page offset against the start of
> + * the dma_buf. It can only be called if @attach has been called
> + * successfully.
> + */
> + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> + u64 *pfn, int *max_order);
> +
> /**
> * @release:
> *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
> void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> struct sg_table *sg_table,
> enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order);
>
> int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
> unsigned long);
On Wed, 8 Jan 2025 at 22:27, Simona Vetter <simona.vetter(a)ffwll.ch> wrote:
>
> On Tue, Dec 24, 2024 at 12:05:19PM +0530, Sumit Garg wrote:
> > Hi Simona,
> >
> > On Wed, 18 Dec 2024 at 16:36, Simona Vetter <simona.vetter(a)ffwll.ch> wrote:
> > >
> > > On Tue, Dec 17, 2024 at 11:07:36AM +0100, Jens Wiklander wrote:
> > > > Hi,
> > > >
> > > > This patch set allocates the restricted DMA-bufs via the TEE subsystem.
> > > >
> > > > The TEE subsystem handles the DMA-buf allocations since it is the TEE
> > > > (OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
> > > > restrictions for the memory used for the DMA-bufs.
> > > >
> > > > I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
> > > > DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
> > > > how to allocate the restricted physical memory.
> > > >
> > > > TEE_IOC_RSTMEM_ALLOC takes in addition to a size and flags parameters also
> > > > a use-case parameter. This is used by the backend TEE driver to decide on
> > > > allocation policy and which devices should be able to access the memory.
> > > >
> > > > Three use-cases (Secure Video Playback, Trusted UI, and Secure Video
> > > > Recording) has been identified so far to serve as examples of what can be
> > > > expected. More use-cases can be added in userspace ABI, but it's up to the
> > > > backend TEE drivers to provide the implementation.
> > > >
> > > > Each use-case has it's own restricted memory pool since different use-cases
> > > > requires isolation from different parts of the system. A restricted memory
> > > > pool can be based on a static carveout instantiated while probing the TEE
> > > > backend driver, or dynamically allocated from CMA and made restricted as
> > > > needed by the TEE.
> > > >
> > > > This can be tested on QEMU with the following steps:
> > > > repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
> > > > -b prototype/sdp-v4
> > > > repo sync -j8
> > > > cd build
> > > > make toolchains -j$(nproc)
> > > > make SPMC_AT_EL=1 all -j$(nproc)
> > > > make SPMC_AT_EL=1 run-only
> > > > # login and at the prompt:
> > > > xtest --sdp-basic
> > > >
> > > > The SPMC_AT_EL=1 parameter configures the build with FF-A and an SPMC at
> > > > S-EL1 inside OP-TEE. The parameter can be changed into SPMC_AT_EL=n to test
> > > > without FF-A using the original SMC ABI instead. Please remember to do
> > > > %rm -rf ../trusted-firmware-a/build/qemu
> > > > for TF-A to be rebuilt properly using the new configuration.
> > > >
> > > > https://optee.readthedocs.io/en/latest/building/prerequisites.html
> > > > list dependencies needed to build the above.
> > > >
> > > > The tests are pretty basic, mostly checking that a Trusted Application in
> > > > the secure world can access and manipulate the memory. There are also some
> > > > negative tests for out of bounds buffers etc.
> > >
> > > I think I've dropped this on earlier encrypted dma-buf discussions for
> > > TEE, but can't find one right now ...
> >
> > Thanks for raising this query.
> >
> > >
> > > Do we have some open source userspace for this? To my knowledge we have
> > > two implementations of encrypted/content protected dma-buf in upstream
> > > right now in the amd and intel gpu drivers, and unless I'm mistaken they
> > > both have some minimal userspace supporting EXT_protected_textures:
> >
> > First of all to clarify the support Jens is adding here for allocating
> > restricted shared memory allocation in TEE subsystem is meant to be
> > generic and not specific to only secure media pipeline use-case. Then
> > here we not only have open source test applications but rather open
> > source firmware too (OP-TEE as a Trusted OS) [1] supporting this as a
> > core feature where we maintain a stable and extensible ABI among the
> > kernel and the OP-TEE core.
> >
> > Restricted memory is a feature enforced by hardware specific firewalls
> > where a particular TEE implementation governs which particular block
> > of memory is accessible to a particular peripheral or a CPU running in
> > a higher privileged mode than the Linux kernel. There can be numeric
> > use-cases surrounding that as follows:
> >
> > - Secure media pipeline where the contents gets decrypted and stored
> > in a restricted buffer which are then accessible only to media display
> > pipeline peripherals.
> > - Trusted user interface where a peripheral takes input from the user
> > and stores it in a restricted buffer which then is accessible to TEE
> > implementation only.
> > - Another possible use-case can be for the TEE implementation to store
> > key material in a restricted buffer which is only accessible to the
> > hardware crypto accelerator.
> >
> > I am sure there will be more use-cases related to this feature but
> > those will only be possible once we provide a stable and extensible
> > restricted memory interface among the Linux user-space and the secure
> > world user-space (normally referred to as Trusted Applications).
> >
> > [1] https://github.com/OP-TEE/optee_os/pull/7159
> >
> > >
> > > https://github.com/KhronosGroup/OpenGL-Registry/blob/main/extensions/EXT/EX…
> > >
> > > It's not great, but it does just barely clear the bar in my opinion. I
> > > guess something in gstreamer or similar video pipeline framework would
> > > also do the job.
> > >
> > > Especially with the context of the uapi discussion in the v1/RFC thread I
> > > think we need more than a bare-bones testcase to make sure this works in
> > > actual use.
> >
> > Currently the TEE subsystem already supports a stable ABI for shared
> > memory allocator among Linux user-space and secure world user-space
> > here [2]. And the stable ABI for restricted memory is also along the
> > same lines meant to be a vendor neutral abstraction for the user-space
> > access. The current test cases not only test the interface but also
> > perform regression tests too.
> >
> > I am also in favour of end to end open source use-cases. But I fear
> > without progressing in a step wise manner as with this proposal we
> > would rather force developers to upstream all the software pieces in
> > one go which will be kind of a chicken and egg situation. I am sure
> > once this feature lands Mediatek folks will be interested to port
> > their secure video playback patchset [3] on top of it. Similarly other
> > silicon vendors like NXP, Qcom etc. will be motivated to do the same.
> >
> > [2] https://docs.kernel.org/userspace-api/tee.html
> > [3] https://lore.kernel.org/linux-arm-kernel/20240515112308.10171-1-yong.wu@med…
>
> We get entire opengl/vulkan driver stacks ready before we merge new drm
> drivers, I really don't think this is too hard from a technical pov. And I
> think the mediatek patches had the same issue of lacking userspace for it,
> so that's not moving things forward.
> -Sima
>
Okay fair enough, I think I get your point. Currently we are missing
at least one peripheral support being the consumer for these
restricted DMA-bufs. So I discussed with Jens offline that we can try
with a crypto peripheral use-case first which can simply be
demonstrated using the current OP-TEE client user-space.
Also, in crypto peripheral use-case we can target the symmetric crypto
use-case first which already has a concept of hardware backed
symmetric key [1]. IOW, we should be able to come up with a generic
symmetric crypto algorithm which can be supported by different crypto
accelerators using a TEE backed restricted key DMA buffer.
[1] https://www.youtube.com/watch?v=GbcpwUBFGDw
-Sumit