Hi,
On Thu, 13 Feb 2025 at 12:40, Boris Brezillon
<boris.brezillon(a)collabora.com> wrote:
> On Thu, 13 Feb 2025 14:46:01 +0530 Sumit Garg <sumit.garg(a)linaro.org> wrote:
> > Yeah but all the prior vendor specific secure/restricted DMA heaps
> > relied on DT information.
>
> Right, but there's nothing in the DMA heap provider API forcing that.
Yeah. DMA heaps are just a way to allocate memory from a specific
place. It allows people to settle on having a single way to do
allocations from weird platform-specific places; the only weird
platform-specific part userspace needs to deal with is figuring out
the name to use. The rest is at least a unified API: the point of
dma-heaps was exactly to have a single coherent API for userspace, not
to create one API for ZONE_CMA and DT ranges and everyone else doing
their own thing.
> > Rather than that it's better
> > for the user to directly ask the TEE device to allocate restricted
> > memory without worrying about how the memory restriction gets
> > enforced.
>
> If the consensus is that restricted/protected memory allocation should
> always be routed to the TEE, sure, but I had the feeling this wasn't as
> clear as that. OTOH, using a dma-heap to expose the TEE-SDP
> implementation provides the same benefits, without making potential
> future non-TEE based implementations a pain for users. The dma-heap
> ioctl being common to all implementations, it just becomes a
> configuration matter if we want to change the heap we rely on for
> protected/restricted buffer allocation. And because heaps have
> unique/well-known names, users can still default to (or rely solely on)
> the TEE-SPD implementation if they want.
>
> > There have been several attempts with DMA heaps in the past which all
> > resulted in a very vendor specific vertically integrated solution. But
> > the solution with TEE subsystem aims to make it generic and vendor
> > agnostic.
>
> Just because all previous protected/restricted dma-heap effort
> failed to make it upstream, doesn't mean dma-heap is the wrong way of
> exposing this feature IMHO.
To be fair, having a TEE implementation does give us a much better
chance of having a sensible cross-vendor plan. And the fact it's
already (sort of accidentally and only on one platform AFAICT) ready
for a 'test' interface, where we can still exercise protected
allocation paths but without having to go through all the
platform-specific setup that is inaccessible to most people, is also
really great! That's probably been the biggest barrier to having this
tested outside of IHVs and OEMs.
But just because TEE is one good backend implementation, doesn't mean
it should be the userspace ABI. Why should userspace care that TEE has
mediated the allocation instead of it being a predefined range within
DT? How does userspace pick which TEE device to use? What advantage
does userspace get from having to have a different codepath to get a
different handle to memory? What about x86?
I think this proposal is looking at it from the wrong direction.
Instead of working upwards from the implementation to userspace, start
with userspace and work downwards. The interesting property to focus
on is allocating memory, not that EL1 is involved behind the scenes.
Cheers,
Daniel
On Thu, 13 Feb 2025 at 14:06, Boris Brezillon
<boris.brezillon(a)collabora.com> wrote:
>
> On Thu, 13 Feb 2025 12:11:52 +0530
> Sumit Garg <sumit.garg(a)linaro.org> wrote:
>
> > Hi Boris,
> >
> > On Thu, 13 Feb 2025 at 01:26, Boris Brezillon
> > <boris.brezillon(a)collabora.com> wrote:
> > >
> > > +Florent, who's working on protected-mode support in Panthor.
> > >
> > > Hi Jens,
> > >
> > > On Tue, 17 Dec 2024 11:07:36 +0100
> > > Jens Wiklander <jens.wiklander(a)linaro.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > This patch set allocates the restricted DMA-bufs via the TEE subsystem.
> > >
> > > We're currently working on protected-mode support for Panthor [1] and it
> > > looks like your series (and the OP-TEE implementation that goes with
> > > it) would allow us to have a fully upstream/open solution for the
> > > protected content use case we're trying to support. I need a bit more
> > > time to play with the implementation but this looks very promising
> > > (especially the lend rstmem feature, which might help us allocate our
> > > FW sections that are supposed to execute code accessing protected
> > > content).
> >
> > Glad to hear that, if you can demonstrate an open source use case
> > based on this series then it will help to land it. We really would
> > love to see support for restricted DMA-buf consumers be it GPU, crypto
> > accelerator, media pipeline etc.
> >
> > >
> > > >
> > > > The TEE subsystem handles the DMA-buf allocations since it is the TEE
> > > > (OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
> > > > restrictions for the memory used for the DMA-bufs.
> > > >
> > > > I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
> > > > DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
> > > > how to allocate the restricted physical memory.
> > >
> > > I'll probably have more questions soon, but here's one to start: any
> > > particular reason you didn't go for a dma-heap to expose restricted
> > > buffer allocation to userspace? I see you already have a cdev you can
> > > take ioctl()s from, but my understanding was that dma-heap was the
> > > standard solution for these device-agnostic/central allocators.
> >
> > This series started with the DMA heap approach only here [1] but later
> > discussions [2] lead us here. To point out specifically:
> >
> > - DMA heaps require reliance on DT to discover static restricted
> > regions carve-outs whereas via the TEE implementation driver (eg.
> > OP-TEE) those can be discovered dynamically.
>
> Hm, the system heap [1] doesn't rely on any DT information AFAICT.
Yeah but all the prior vendor specific secure/restricted DMA heaps
relied on DT information.
> The dynamic allocation scheme, where the TEE implementation allocates a
> chunk of protected memory for us would have a similar behavior, I guess.
In a dynamic scheme, the allocation will still be from CMA or system
heap depending on TEE implementation capabilities but the restriction
will be enforced via interaction with TEE.
>
> > - Dynamic allocation of buffers and making them restricted requires
> > vendor specific driver hooks with DMA heaps whereas the TEE subsystem
> > abstracts that out with underlying TEE implementation (eg. OP-TEE)
> > managing the dynamic buffer restriction.
>
> Yeah, the lend rstmem feature is clearly something tee specific, and I
> think that's okay to assume the user knows the protection request
> should go through the tee subsystem in that case.
Yeah but how will the user discover that? Rather than that it's better
for the user to directly ask the TEE device to allocate restricted
memory without worrying about how the memory restriction gets
enforced.
>
> > - TEE subsystem already has a well defined user-space interface for
> > managing shared memory buffers with TEE and restricted DMA buffers
> > will be yet another interface managed along similar lines.
>
> Okay, so the very reason I'm asking about the dma-buf heap interface is
> because there might be cases where the protected/restricted allocation
> doesn't go through the TEE (Mediatek has a TEE-free implementation
> for instance, but I realize vendor implementations are probably not the
> best selling point :-/).
You can always have a system with memory and peripheral access
permissions setup during boot (or even have a pre-configured hardware
as a special case) prior to booting up the kernel too. But that even
gets somehow configured by a TEE implementation during boot, so
calling it a TEE-free implementation seems over-simplified and not a
scalable solution. However, this patchset [1] from Mediatek requires
runtime TEE interaction too.
[1] https://lore.kernel.org/linux-arm-kernel/20240515112308.10171-1-yong.wu@med…
> If we expose things as a dma-heap, we have
> a solution where integrators can pick the dma-heap they think is
> relevant for protected buffer allocations without the various drivers
> (GPU, video codec, ...) having to implement a dispatch function for all
> possible implementations. The same goes for userspace allocations,
> where passing a dma-heap name, is simpler than supporting different
> ioctl()s based on the allocation backend.
There have been several attempts with DMA heaps in the past which all
resulted in a very vendor specific vertically integrated solution. But
the solution with TEE subsystem aims to make it generic and vendor
agnostic.
>
> [1]https://elixir.bootlin.com/linux/v6.13.2/source/drivers/dma-buf/heaps/sys…
-Sumit
Hi Boris,
On Thu, 13 Feb 2025 at 01:26, Boris Brezillon
<boris.brezillon(a)collabora.com> wrote:
>
> +Florent, who's working on protected-mode support in Panthor.
>
> Hi Jens,
>
> On Tue, 17 Dec 2024 11:07:36 +0100
> Jens Wiklander <jens.wiklander(a)linaro.org> wrote:
>
> > Hi,
> >
> > This patch set allocates the restricted DMA-bufs via the TEE subsystem.
>
> We're currently working on protected-mode support for Panthor [1] and it
> looks like your series (and the OP-TEE implementation that goes with
> it) would allow us to have a fully upstream/open solution for the
> protected content use case we're trying to support. I need a bit more
> time to play with the implementation but this looks very promising
> (especially the lend rstmem feature, which might help us allocate our
> FW sections that are supposed to execute code accessing protected
> content).
Glad to hear that, if you can demonstrate an open source use case
based on this series then it will help to land it. We really would
love to see support for restricted DMA-buf consumers be it GPU, crypto
accelerator, media pipeline etc.
>
> >
> > The TEE subsystem handles the DMA-buf allocations since it is the TEE
> > (OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
> > restrictions for the memory used for the DMA-bufs.
> >
> > I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
> > DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
> > how to allocate the restricted physical memory.
>
> I'll probably have more questions soon, but here's one to start: any
> particular reason you didn't go for a dma-heap to expose restricted
> buffer allocation to userspace? I see you already have a cdev you can
> take ioctl()s from, but my understanding was that dma-heap was the
> standard solution for these device-agnostic/central allocators.
This series started with the DMA heap approach only here [1] but later
discussions [2] lead us here. To point out specifically:
- DMA heaps require reliance on DT to discover static restricted
regions carve-outs whereas via the TEE implementation driver (eg.
OP-TEE) those can be discovered dynamically.
- Dynamic allocation of buffers and making them restricted requires
vendor specific driver hooks with DMA heaps whereas the TEE subsystem
abstracts that out with underlying TEE implementation (eg. OP-TEE)
managing the dynamic buffer restriction.
- TEE subsystem already has a well defined user-space interface for
managing shared memory buffers with TEE and restricted DMA buffers
will be yet another interface managed along similar lines.
[1] https://lore.kernel.org/lkml/mzur3odofwwrdqnystozjgf3qtvb73wqjm6g2vf5dfsqie…
[2] https://lore.kernel.org/lkml/CAFA6WYPtp3H5JhxzgH9=z2EvNL7Kdku3EmG1aDkTS-gjF…
-Sumit
>
> Regards,
>
> Boris
>
> [1]https://lwn.net/ml/all/cover.1738228114.git.florent.tomasin@arm.com/#t
On Wed, Feb 12, 2025 at 10:29:32AM +0000, Florent Tomasin wrote:
>
>
> On 12/02/2025 10:01, Maxime Ripard wrote:
> > On Wed, Feb 12, 2025 at 09:49:56AM +0000, Florent Tomasin wrote:
> >> Note that the CMA patches were initially shared to help reproduce my
> >> environment of development, I can isolate them in a separate patch
> >> series and include a reference or "base-commit:" tag to it in the
> >> Panthor protected mode RFC, to help progress this review in another
> >> thread. It will avoid overlapping these two topics:
> >>
> >> - Multiple standalone CMA heaps support
> >> - Panthor protected mode handling
> >
> > You keep insisting on using CMA here, but it's really not clear to me
> > why you would need CMA in the first place.
> >
> > By CMA, do you mean the CMA allocator, and thus would provide buffers
> > through the usual dma_alloc_* API, or would any allocator providing
> > physically contiguous memory work?
>
> You are correct only the CMA allocator is relevant. I needed a way to
> sub-allocate from a carved-out memory.
I'm still confused, sorry. You're saying that you require CMA but...
> > In the latter case, would something like this work:
> > https://lore.kernel.org/all/20240515-dma-buf-ecc-heap-v1-1-54cbbd049511@ker…
>
> Thanks for sharing this link, I was not aware previous work was done
> on this aspect. The new carveout heap introduced in the series could
> probably be a good alternative. I will play-around with it and share
> some updates.
... you seem to be ok with a driver that doesn't use it?
Maxime
On Wed, Feb 12, 2025 at 09:49:56AM +0000, Florent Tomasin wrote:
> Note that the CMA patches were initially shared to help reproduce my
> environment of development, I can isolate them in a separate patch
> series and include a reference or "base-commit:" tag to it in the
> Panthor protected mode RFC, to help progress this review in another
> thread. It will avoid overlapping these two topics:
>
> - Multiple standalone CMA heaps support
> - Panthor protected mode handling
You keep insisting on using CMA here, but it's really not clear to me
why you would need CMA in the first place.
By CMA, do you mean the CMA allocator, and thus would provide buffers
through the usual dma_alloc_* API, or would any allocator providing
physically contiguous memory work?
In the latter case, would something like this work:
https://lore.kernel.org/all/20240515-dma-buf-ecc-heap-v1-1-54cbbd049511@ker…
Maxime
Hi Boris,
On Fri, Feb 07, 2025 at 04:02:53PM +0100, Boris Brezillon wrote:
> Sorry for joining the party late, a couple of comments to back Akash
> and Nicolas' concerns.
>
> On Wed, 05 Feb 2025 13:14:14 -0500
> Nicolas Dufresne <nicolas(a)ndufresne.ca> wrote:
>
> > Le mercredi 05 février 2025 à 15:52 +0100, Maxime Ripard a écrit :
> > > On Mon, Feb 03, 2025 at 04:43:23PM +0000, Florent Tomasin wrote:
> > > > Hi Maxime, Nicolas
> > > >
> > > > On 30/01/2025 17:47, Nicolas Dufresne wrote:
> > > > > Le jeudi 30 janvier 2025 à 17:38 +0100, Maxime Ripard a écrit :
> > > > > > Hi Nicolas,
> > > > > >
> > > > > > On Thu, Jan 30, 2025 at 10:59:56AM -0500, Nicolas Dufresne wrote:
> > > > > > > Le jeudi 30 janvier 2025 à 14:46 +0100, Maxime Ripard a écrit :
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I started to review it, but it's probably best to discuss it here.
> > > > > > > >
> > > > > > > > On Thu, Jan 30, 2025 at 01:08:56PM +0000, Florent Tomasin wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This is a patch series covering the support for protected mode execution in
> > > > > > > > > Mali Panthor CSF kernel driver.
> > > > > > > > >
> > > > > > > > > The Mali CSF GPUs come with the support for protected mode execution at the
> > > > > > > > > HW level. This feature requires two main changes in the kernel driver:
> > > > > > > > >
> > > > > > > > > 1) Configure the GPU with a protected buffer. The system must provide a DMA
> > > > > > > > > heap from which the driver can allocate a protected buffer.
> > > > > > > > > It can be a carved-out memory or dynamically allocated protected memory region.
> > > > > > > > > Some system includes a trusted FW which is in charge of the protected memory.
> > > > > > > > > Since this problem is integration specific, the Mali Panthor CSF kernel
> > > > > > > > > driver must import the protected memory from a device specific exporter.
> > > > > > > >
> > > > > > > > Why do you need a heap for it in the first place? My understanding of
> > > > > > > > your series is that you have a carved out memory region somewhere, and
> > > > > > > > you want to allocate from that carved out memory region your buffers.
> > > > > > > >
> > > > > > > > How is that any different from using a reserved-memory region, adding
> > > > > > > > the reserved-memory property to the GPU device and doing all your
> > > > > > > > allocation through the usual dma_alloc_* API?
> > > > > > >
> > > > > > > How do you then multiplex this region so it can be shared between
> > > > > > > GPU/Camera/Display/Codec drivers and also userspace ?
> > > > > >
> > > > > > You could point all the devices to the same reserved memory region, and
> > > > > > they would all allocate from there, including for their userspace-facing
> > > > > > allocations.
> > > > >
> > > > > I get that using memory region is somewhat more of an HW description, and
> > > > > aligned with what a DT is supposed to describe. One of the challenge is that
> > > > > Mediatek heap proposal endup calling into their TEE, meaning knowing the region
> > > > > is not that useful. You actually need the TEE APP guid and its IPC protocol. If
> > > > > we can dell drivers to use a head instead, we can abstract that SoC specific
> > > > > complexity. I believe each allocated addressed has to be mapped to a zone, and
> > > > > that can only be done in the secure application. I can imagine similar needs
> > > > > when the protection is done using some sort of a VM / hypervisor.
> > > > >
> > > > > Nicolas
> > > > >
> > > >
> > > > The idea in this design is to abstract the heap management from the
> > > > Panthor kernel driver (which consumes a DMA buffer from it).
> > > >
> > > > In a system, an integrator would have implemented a secure heap driver,
> > > > and could be based on TEE or a carved-out memory with restricted access,
> > > > or else. This heap driver would be responsible of implementing the
> > > > logic to: allocate, free, refcount, etc.
> > > >
> > > > The heap would be retrieved by the Panthor kernel driver in order to
> > > > allocate protected memory to load the FW and allow the GPU to enter/exit
> > > > protected mode. This memory would not belong to a user space process.
> > > > The driver allocates it at the time of loading the FW and initialization
> > > > of the GPU HW. This is a device globally owned protected memory.
> > >
> > > The thing is, it's really not clear why you absolutely need to have the
> > > Panthor driver involved there. It won't be transparent to userspace,
> > > since you'd need an extra flag at allocation time, and the buffers
> > > behave differently. If userspace has to be aware of it, what's the
> > > advantage to your approach compared to just exposing a heap for those
> > > secure buffers, and letting userspace allocate its buffers from there?
> >
> > Unless I'm mistaken, the Panthor driver loads its own firmware. Since loading
> > the firmware requires placing the data in a protected memory region, and that
> > this aspect has no exposure to userspace, how can Panthor not be implicated ?
>
> Right, the very reason we need protected memory early is because some
> FW sections need to be allocated from the protected pool, otherwise the
> TEE will fault as soon at the FW enters the so-called 'protected mode'.
How does that work if you don't have some way to allocate the protected
memory? You can still submit jobs to the GPU, but you can't submit /
execute "protected jobs"?
> Now, it's not impossible to work around this limitation. For instance,
> we could load the FW without this protected section by default (what we
> do right now), and then provide a DRM_PANTHOR_ENABLE_FW_PROT_MODE
> ioctl that would take a GEM object imported from a dmabuf allocated
> from the protected dma-heap by userspace. We can then reset the FW and
> allow it to operate in protected mode after that point.
Urgh, I'd rather avoid that dance if possible :)
> This approach has two downsides though:
>
> 1. We have no way of checking that the memory we're passed is actually
> suitable for FW execution in a protected context. If we're passed
> random memory, this will likely hang the platform as soon as we enter
> protected mode.
It's a current limitation of dma-buf in general, and you'd have the same
issue right now if someone imports a buffer, or misconfigure the heap
for a !protected heap.
I'd really like to have some way to store some metadata in dma_buf, if
only to tell that the buffer is protected.
I suspect you'd also need that if you do things like do protected video
playback through a codec, get a protected frame, and want to import that
into the GPU. Depending on how you allocate it, either the codec or the
GPU or both will want to make sure it's protected.
> 2. If the driver already boot the FW and exposed a DRI node, we might
> have GPU workloads running, and doing a FW reset might incur a slight
> delay in GPU jobs execution.
>
> I think #1 is a more general issue that applies to suspend buffers
> allocated for GPU contexts too. If we expose ioctls where we take
> protected memory buffers that can possibly lead to crashes if they are
> not real protected memory regions, and we have no way to ensure the
> memory is protected, we probably want to restrict these ioctls/modes to
> some high-privilege CAP_SYS_.
>
> For #2, that's probably something we can live with, since it's a
> one-shot thing. If it becomes an issue, we can even make sure we enable
> the FW protected-mode before the GPU starts being used for real.
>
> This being said, I think the problem applies outside Panthor, and it
> might be that the video codec can't reset the FW/HW block to switch to
> protected mode as easily as Panthor.
>
> Note that there's also downsides to the reserved-memory node approach,
> where some bootloader stage would ask the secure FW to reserve a
> portion of mem and pass this through the DT. This sort of things tend to
> be an integration mess, where you need all the pieces of the stack (TEE,
> u-boot, MTK dma-heap driver, gbm, ...) to be at a certain version to
> work properly. If we go the ioctl() way, we restrict the scope to the
> TEE, gbm/mesa and the protected-dma-heap driver, which is still a lot,
> but we've ripped the bootloader out of the equation at least.
Yeah. I also think there's two discussions in parallel here:
1) Being able to allocate protected buffers from the driver
2) Exposing an interface to allocate those to userspace
I'm not really convinced we need 2, but 1 is obviously needed from what
you're saying.
Maxime
On 03/02/2025 16:31, Florent Tomasin wrote:
> Hi Krzysztof
>
> On 30/01/2025 13:25, Krzysztof Kozlowski wrote:
>> On 30/01/2025 14:08, Florent Tomasin wrote:
>>> Allow mali-valhall-csf driver to retrieve a protected
>>> heap at probe time by passing the name of the heap
>>> as attribute to the device tree GPU node.
>>
>> Please wrap commit message according to Linux coding style / submission
>> process (neither too early nor over the limit):
>> https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/process/subm…
> Apologies, I think I made quite few other mistakes in the style of the
> patches I sent. I will work on improving this aspect, appreciated
>
>> Why this cannot be passed by phandle, just like all reserved regions?
>>
>> From where do you take these protected heaps? Firmware? This would
>> explain why no relation is here (no probe ordering, no device links,
>> nothing connecting separate devices).
>
> The protected heap is generaly obtained from a firmware (TEE) and could
> sometimes be a carved-out memory with restricted access.
Which is a reserved memory, isn't it?
>
> The Panthor CSF kernel driver does not own or manage the protected heap
> and is instead a consumer of it (assuming the heap is made available by
> the system integrator).
>
> I initially used a phandle, but then I realised it would introduce a new
> API to share the heap across kernel driver. In addition I found this
> patch series:
> -
> https://lore.kernel.org/lkml/20230911023038.30649-1-yong.wu@mediatek.com/#t
>
> which introduces a DMA Heap API to the rest of the kernel to find a
> heap by name:
> - dma_heap_find()
>
> I then decided to follow that approach to help isolate the heap
> management from the GPU driver code. In the Panthor driver, if the
> heap is not found at probe time, the driver will defer the probe until
> the exporter made it available.
I don't talk here really about the driver but even above mediatek
patchset uses reserved memory bindings.
You explained some things about driver yet you did not answer the
question. This looks like reserved memory. If it does not, bring
arguments why this binding cannot be a reserved memory, why hardware is
not a carve out memory.
Best regards,
Krzysztof
Le vendredi 07 février 2025 à 16:02 +0100, Boris Brezillon a écrit :
> Sorry for joining the party late, a couple of comments to back Akash
> and Nicolas' concerns.
>
> On Wed, 05 Feb 2025 13:14:14 -0500
> Nicolas Dufresne <nicolas(a)ndufresne.ca> wrote:
>
> > Le mercredi 05 février 2025 à 15:52 +0100, Maxime Ripard a écrit :
> > > On Mon, Feb 03, 2025 at 04:43:23PM +0000, Florent Tomasin wrote:
> > > > Hi Maxime, Nicolas
> > > >
> > > > On 30/01/2025 17:47, Nicolas Dufresne wrote:
> > > > > Le jeudi 30 janvier 2025 à 17:38 +0100, Maxime Ripard a écrit :
> > > > > > Hi Nicolas,
> > > > > >
> > > > > > On Thu, Jan 30, 2025 at 10:59:56AM -0500, Nicolas Dufresne wrote:
> > > > > > > Le jeudi 30 janvier 2025 à 14:46 +0100, Maxime Ripard a écrit :
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I started to review it, but it's probably best to discuss it here.
> > > > > > > >
> > > > > > > > On Thu, Jan 30, 2025 at 01:08:56PM +0000, Florent Tomasin wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This is a patch series covering the support for protected mode execution in
> > > > > > > > > Mali Panthor CSF kernel driver.
> > > > > > > > >
> > > > > > > > > The Mali CSF GPUs come with the support for protected mode execution at the
> > > > > > > > > HW level. This feature requires two main changes in the kernel driver:
> > > > > > > > >
> > > > > > > > > 1) Configure the GPU with a protected buffer. The system must provide a DMA
> > > > > > > > > heap from which the driver can allocate a protected buffer.
> > > > > > > > > It can be a carved-out memory or dynamically allocated protected memory region.
> > > > > > > > > Some system includes a trusted FW which is in charge of the protected memory.
> > > > > > > > > Since this problem is integration specific, the Mali Panthor CSF kernel
> > > > > > > > > driver must import the protected memory from a device specific exporter.
> > > > > > > >
> > > > > > > > Why do you need a heap for it in the first place? My understanding of
> > > > > > > > your series is that you have a carved out memory region somewhere, and
> > > > > > > > you want to allocate from that carved out memory region your buffers.
> > > > > > > >
> > > > > > > > How is that any different from using a reserved-memory region, adding
> > > > > > > > the reserved-memory property to the GPU device and doing all your
> > > > > > > > allocation through the usual dma_alloc_* API?
> > > > > > >
> > > > > > > How do you then multiplex this region so it can be shared between
> > > > > > > GPU/Camera/Display/Codec drivers and also userspace ?
> > > > > >
> > > > > > You could point all the devices to the same reserved memory region, and
> > > > > > they would all allocate from there, including for their userspace-facing
> > > > > > allocations.
> > > > >
> > > > > I get that using memory region is somewhat more of an HW description, and
> > > > > aligned with what a DT is supposed to describe. One of the challenge is that
> > > > > Mediatek heap proposal endup calling into their TEE, meaning knowing the region
> > > > > is not that useful. You actually need the TEE APP guid and its IPC protocol. If
> > > > > we can dell drivers to use a head instead, we can abstract that SoC specific
> > > > > complexity. I believe each allocated addressed has to be mapped to a zone, and
> > > > > that can only be done in the secure application. I can imagine similar needs
> > > > > when the protection is done using some sort of a VM / hypervisor.
> > > > >
> > > > > Nicolas
> > > > >
> > > >
> > > > The idea in this design is to abstract the heap management from the
> > > > Panthor kernel driver (which consumes a DMA buffer from it).
> > > >
> > > > In a system, an integrator would have implemented a secure heap driver,
> > > > and could be based on TEE or a carved-out memory with restricted access,
> > > > or else. This heap driver would be responsible of implementing the
> > > > logic to: allocate, free, refcount, etc.
> > > >
> > > > The heap would be retrieved by the Panthor kernel driver in order to
> > > > allocate protected memory to load the FW and allow the GPU to enter/exit
> > > > protected mode. This memory would not belong to a user space process.
> > > > The driver allocates it at the time of loading the FW and initialization
> > > > of the GPU HW. This is a device globally owned protected memory.
> > >
> > > The thing is, it's really not clear why you absolutely need to have the
> > > Panthor driver involved there. It won't be transparent to userspace,
> > > since you'd need an extra flag at allocation time, and the buffers
> > > behave differently. If userspace has to be aware of it, what's the
> > > advantage to your approach compared to just exposing a heap for those
> > > secure buffers, and letting userspace allocate its buffers from there?
> >
> > Unless I'm mistaken, the Panthor driver loads its own firmware. Since loading
> > the firmware requires placing the data in a protected memory region, and that
> > this aspect has no exposure to userspace, how can Panthor not be implicated ?
>
> Right, the very reason we need protected memory early is because some
> FW sections need to be allocated from the protected pool, otherwise the
> TEE will fault as soon at the FW enters the so-called 'protected mode'.
>
> Now, it's not impossible to work around this limitation. For instance,
> we could load the FW without this protected section by default (what we
> do right now), and then provide a DRM_PANTHOR_ENABLE_FW_PROT_MODE
> ioctl that would take a GEM object imported from a dmabuf allocated
> from the protected dma-heap by userspace. We can then reset the FW and
> allow it to operate in protected mode after that point. This approach
> has two downsides though:
>
> 1. We have no way of checking that the memory we're passed is actually
> suitable for FW execution in a protected context. If we're passed
> random memory, this will likely hang the platform as soon as we enter
> protected mode.
>
> 2. If the driver already boot the FW and exposed a DRI node, we might
> have GPU workloads running, and doing a FW reset might incur a slight
> delay in GPU jobs execution.
>
> I think #1 is a more general issue that applies to suspend buffers
> allocated for GPU contexts too. If we expose ioctls where we take
> protected memory buffers that can possibly lead to crashes if they are
> not real protected memory regions, and we have no way to ensure the
> memory is protected, we probably want to restrict these ioctls/modes to
> some high-privilege CAP_SYS_.
>
> For #2, that's probably something we can live with, since it's a
> one-shot thing. If it becomes an issue, we can even make sure we enable
> the FW protected-mode before the GPU starts being used for real.
>
> This being said, I think the problem applies outside Panthor, and it
> might be that the video codec can't reset the FW/HW block to switch to
> protected mode as easily as Panthor.
Overall the reset and reboot method is pretty ugly in my opinion. But to stick
with the pure rationale, rebooting the SCP on MTK is much harder, since its not
specific to a single HW/driver.
Other codecs like Samsung MFC, Venus/Iris, Chips&Media, etc. that approach seams
plausible, but we still can't trust the buffer, which to me is not acceptable.
>
> Note that there's also downsides to the reserved-memory node approach,
> where some bootloader stage would ask the secure FW to reserve a
> portion of mem and pass this through the DT. This sort of things tend to
> be an integration mess, where you need all the pieces of the stack (TEE,
> u-boot, MTK dma-heap driver, gbm, ...) to be at a certain version to
> work properly. If we go the ioctl() way, we restrict the scope to the
> TEE, gbm/mesa and the protected-dma-heap driver, which is still a lot,
> but we've ripped the bootloader out of the equation at least.
>
> Regards,
>
> Boris
On Mon, Feb 03, 2025 at 04:43:23PM +0000, Florent Tomasin wrote:
> Hi Maxime, Nicolas
>
> On 30/01/2025 17:47, Nicolas Dufresne wrote:
> > Le jeudi 30 janvier 2025 à 17:38 +0100, Maxime Ripard a écrit :
> >> Hi Nicolas,
> >>
> >> On Thu, Jan 30, 2025 at 10:59:56AM -0500, Nicolas Dufresne wrote:
> >>> Le jeudi 30 janvier 2025 à 14:46 +0100, Maxime Ripard a écrit :
> >>>> Hi,
> >>>>
> >>>> I started to review it, but it's probably best to discuss it here.
> >>>>
> >>>> On Thu, Jan 30, 2025 at 01:08:56PM +0000, Florent Tomasin wrote:
> >>>>> Hi,
> >>>>>
> >>>>> This is a patch series covering the support for protected mode execution in
> >>>>> Mali Panthor CSF kernel driver.
> >>>>>
> >>>>> The Mali CSF GPUs come with the support for protected mode execution at the
> >>>>> HW level. This feature requires two main changes in the kernel driver:
> >>>>>
> >>>>> 1) Configure the GPU with a protected buffer. The system must provide a DMA
> >>>>> heap from which the driver can allocate a protected buffer.
> >>>>> It can be a carved-out memory or dynamically allocated protected memory region.
> >>>>> Some system includes a trusted FW which is in charge of the protected memory.
> >>>>> Since this problem is integration specific, the Mali Panthor CSF kernel
> >>>>> driver must import the protected memory from a device specific exporter.
> >>>>
> >>>> Why do you need a heap for it in the first place? My understanding of
> >>>> your series is that you have a carved out memory region somewhere, and
> >>>> you want to allocate from that carved out memory region your buffers.
> >>>>
> >>>> How is that any different from using a reserved-memory region, adding
> >>>> the reserved-memory property to the GPU device and doing all your
> >>>> allocation through the usual dma_alloc_* API?
> >>>
> >>> How do you then multiplex this region so it can be shared between
> >>> GPU/Camera/Display/Codec drivers and also userspace ?
> >>
> >> You could point all the devices to the same reserved memory region, and
> >> they would all allocate from there, including for their userspace-facing
> >> allocations.
> >
> > I get that using memory region is somewhat more of an HW description, and
> > aligned with what a DT is supposed to describe. One of the challenge is that
> > Mediatek heap proposal endup calling into their TEE, meaning knowing the region
> > is not that useful. You actually need the TEE APP guid and its IPC protocol. If
> > we can dell drivers to use a head instead, we can abstract that SoC specific
> > complexity. I believe each allocated addressed has to be mapped to a zone, and
> > that can only be done in the secure application. I can imagine similar needs
> > when the protection is done using some sort of a VM / hypervisor.
> >
> > Nicolas
> >
>
> The idea in this design is to abstract the heap management from the
> Panthor kernel driver (which consumes a DMA buffer from it).
>
> In a system, an integrator would have implemented a secure heap driver,
> and could be based on TEE or a carved-out memory with restricted access,
> or else. This heap driver would be responsible of implementing the
> logic to: allocate, free, refcount, etc.
>
> The heap would be retrieved by the Panthor kernel driver in order to
> allocate protected memory to load the FW and allow the GPU to enter/exit
> protected mode. This memory would not belong to a user space process.
> The driver allocates it at the time of loading the FW and initialization
> of the GPU HW. This is a device globally owned protected memory.
The thing is, it's really not clear why you absolutely need to have the
Panthor driver involved there. It won't be transparent to userspace,
since you'd need an extra flag at allocation time, and the buffers
behave differently. If userspace has to be aware of it, what's the
advantage to your approach compared to just exposing a heap for those
secure buffers, and letting userspace allocate its buffers from there?
> When I came across this patch series:
> -
> https://lore.kernel.org/lkml/20230911023038.30649-1-yong.wu@mediatek.com/#t
> I found it could help abstract the interface between the secure heap and
> the integration of protected memory in Panthor.
>
> A kernel driver would have to find the heap: `dma_heap_find()`, then
> request allocation of a DMA buffer from it. The heap driver would deal
> with the specifities of the protected memory on the system.
Sure, but we still have to address *why* it would be a good idea for the
driver to do it in the first place. The mediatek series had the same
feedback.
Maxime
Le lundi 03 février 2025 à 16:43 +0000, Florent Tomasin a écrit :
> Hi Maxime, Nicolas
>
> On 30/01/2025 17:47, Nicolas Dufresne wrote:
> > Le jeudi 30 janvier 2025 à 17:38 +0100, Maxime Ripard a écrit :
> > > Hi Nicolas,
> > >
> > > On Thu, Jan 30, 2025 at 10:59:56AM -0500, Nicolas Dufresne wrote:
> > > > Le jeudi 30 janvier 2025 à 14:46 +0100, Maxime Ripard a écrit :
> > > > > Hi,
> > > > >
> > > > > I started to review it, but it's probably best to discuss it here.
> > > > >
> > > > > On Thu, Jan 30, 2025 at 01:08:56PM +0000, Florent Tomasin wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This is a patch series covering the support for protected mode execution in
> > > > > > Mali Panthor CSF kernel driver.
> > > > > >
> > > > > > The Mali CSF GPUs come with the support for protected mode execution at the
> > > > > > HW level. This feature requires two main changes in the kernel driver:
> > > > > >
> > > > > > 1) Configure the GPU with a protected buffer. The system must provide a DMA
> > > > > > heap from which the driver can allocate a protected buffer.
> > > > > > It can be a carved-out memory or dynamically allocated protected memory region.
> > > > > > Some system includes a trusted FW which is in charge of the protected memory.
> > > > > > Since this problem is integration specific, the Mali Panthor CSF kernel
> > > > > > driver must import the protected memory from a device specific exporter.
> > > > >
> > > > > Why do you need a heap for it in the first place? My understanding of
> > > > > your series is that you have a carved out memory region somewhere, and
> > > > > you want to allocate from that carved out memory region your buffers.
> > > > >
> > > > > How is that any different from using a reserved-memory region, adding
> > > > > the reserved-memory property to the GPU device and doing all your
> > > > > allocation through the usual dma_alloc_* API?
> > > >
> > > > How do you then multiplex this region so it can be shared between
> > > > GPU/Camera/Display/Codec drivers and also userspace ?
> > >
> > > You could point all the devices to the same reserved memory region, and
> > > they would all allocate from there, including for their userspace-facing
> > > allocations.
> >
> > I get that using memory region is somewhat more of an HW description, and
> > aligned with what a DT is supposed to describe. One of the challenge is that
> > Mediatek heap proposal endup calling into their TEE, meaning knowing the region
> > is not that useful. You actually need the TEE APP guid and its IPC protocol. If
> > we can dell drivers to use a head instead, we can abstract that SoC specific
> > complexity. I believe each allocated addressed has to be mapped to a zone, and
> > that can only be done in the secure application. I can imagine similar needs
> > when the protection is done using some sort of a VM / hypervisor.
> >
> > Nicolas
> >
>
> The idea in this design is to abstract the heap management from the
> Panthor kernel driver (which consumes a DMA buffer from it).
>
> In a system, an integrator would have implemented a secure heap driver,
> and could be based on TEE or a carved-out memory with restricted access,
> or else. This heap driver would be responsible of implementing the
> logic to: allocate, free, refcount, etc.
>
> The heap would be retrieved by the Panthor kernel driver in order to
> allocate protected memory to load the FW and allow the GPU to enter/exit
> protected mode. This memory would not belong to a user space process.
> The driver allocates it at the time of loading the FW and initialization
> of the GPU HW. This is a device globally owned protected memory.
This use case also applies well for codec. The Mediatek SCP firmware needs to be
loaded with a restricted memory too, its a very similar scenario, plus Mediatek
chips often include a Mali. On top of that, V4L2 codecs (in general) do need to
allocate internal scratch buffer for the IP to write to for things like motion
vectors, reconstruction frames, entropy statistics, etc. The IP will only be
able to write if the memory is restricted.
Nicolas
>
> When I came across this patch series:
> -
> https://lore.kernel.org/lkml/20230911023038.30649-1-yong.wu@mediatek.com/#t
> I found it could help abstract the interface between the secure heap and
> the integration of protected memory in Panthor.
>
> A kernel driver would have to find the heap: `dma_heap_find()`, then
> request allocation of a DMA buffer from it. The heap driver would deal
> with the specifities of the protected memory on the system.
>
> > >
> > > > Also, how the secure memory is allocted / obtained is a process that
> > > > can vary a lot between SoC, so implementation details assumption
> > > > should not be coded in the driver.
> > >
> > > But yeah, we agree there, it's also the point I was trying to make :)
> > >
> > > Maxime
> >
>
> Agree with your point, the Panthor kernel driver may not be aware of the
> heap management logic. As an alternative to the DMA heap API used here,
> I also tried to expose the heap by passing the phandle of a "heap"
> device to Panthor. The reference to the DMA heap was stored as a private
> data of the heap device as a new type: `struct dma_heap_import_info`,
> similar to the existing type: `struct dma_heap_export_info`.
> This made me think it could be problematic, as the private data type
> would have to be cast before accessing it from the importer driver. I
> worried about a mis-use of the types with this approach.
>
> Regards,
> Florent
Hi Florent,
Le lundi 03 février 2025 à 13:36 +0000, Florent Tomasin a écrit :
>
> On 30/01/2025 13:28, Maxime Ripard wrote:
> > Hi,
> >
> > On Thu, Jan 30, 2025 at 01:08:57PM +0000, Florent Tomasin wrote:
> > > Introduce a CMA Heap dt-binding allowing custom
> > > CMA heap registrations.
> > >
> > > * Note to the reviewers:
> > > The patch was used for the development of the protected mode
Just to avoid divergence in nomenclature, and because this is not a new subject,
perhaps you should also adhere to the name "restricted". Both Linaro and
Mediatek have moved from "secure" to that name in their proposal. As you are the
third proposing this (at least for the proposal that are CCed on linux-media), I
would have expected in your cover letter a summary of how the other requirement
have been blended in your proposal.
regards,
Nicolas
> > > feature in Panthor CSF kernel driver and is not initially thought
> > > to land in the Linux kernel. It is mostly relevant if someone
> > > wants to reproduce the environment of testing. Please, raise
> > > interest if you think the patch has value in the Linux kernel.
> > >
> > > Signed-off-by: Florent Tomasin <florent.tomasin(a)arm.com>
> > > ---
> > > .../devicetree/bindings/dma/linux,cma.yml | 43 +++++++++++++++++++
> > > 1 file changed, 43 insertions(+)
> > > create mode 100644 Documentation/devicetree/bindings/dma/linux,cma.yml
> > >
> > > diff --git a/Documentation/devicetree/bindings/dma/linux,cma.yml b/Documentation/devicetree/bindings/dma/linux,cma.yml
> > > new file mode 100644
> > > index 000000000000..c532e016bbe5
> > > --- /dev/null
> > > +++ b/Documentation/devicetree/bindings/dma/linux,cma.yml
> > > @@ -0,0 +1,43 @@
> > > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > > +%YAML 1.2
> > > +---
> > > +$id: http://devicetree.org/schemas/dma/linux-cma.yaml#
> > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > +
> > > +title: Custom Linux CMA heap
> > > +
> > > +description:
> > > + The custom Linux CMA heap device tree node allows registering
> > > + of multiple CMA heaps.
> > > +
> > > + The CMA heap name will match the node name of the "memory-region".
> > > +
> > > +properties:
> > > + compatible:
> > > + enum:
> > > + - linux,cma
> > > +
> > > + memory-region:
> > > + maxItems: 1
> > > + description: |
> > > + Phandle to the reserved memory node associated with the CMA Heap.
> > > + The reserved memory node must follow this binding convention:
> > > + - Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> > > +
> > > +examples:
> > > + - |
> > > + reserved-memory {
> > > + #address-cells = <2>;
> > > + #size-cells = <2>;
> > > +
> > > + custom_cma_heap: custom-cma-heap {
> > > + compatible = "shared-dma-pool";
> > > + reg = <0x0 0x90600000 0x0 0x1000000>;
> > > + reusable;
> > > + };
> > > + };
> > > +
> > > + device_cma_heap: device-cma-heap {
> > > + compatible = "linux,cma";
> > > + memory-region = <&custom_cma_heap>;
> > > + };
> >
> > Isn't it redundant with the linux,cma-default shared-dma-pool property
> > already?
> >
> > Maxime
>
> Hi Maxime,
>
> Please correct me if my understanding is wrong,
>
> The existing properties: linux,cma-default and shared-dma-pool, do not
> allow the creations of multiple standalone CMA heaps, those will create
> a single CMA heap: `dma_contiguous_default_area`? Other CMA heaps will
> be bound to a driver.
>
> I introduced the "linux,cma" to allow creating multiple standalone CMA
> heaps, with the intention of validating the protected mode support on
> Mali CSG GPUs. It was included in the RFC in there are interests in
> this approach.
>
> Since the Panthor CSF kernel driver does not own or manage a heap,
> I needed a way to create a standalone heap. The idea here is for the
> kernel driver to be an importer. I relied on a patch series to retrieve
> the heap and allocate a DMA buffer from it:
> - dma_heap_find()
> - dma_heap_buffer_alloc()
> - dma_heap_put()
>
> Ref:
> https://lore.kernel.org/lkml/20230911023038.30649-1-yong.wu@mediatek.com/#t
>
>
> Since the protected/secure memory management is integration specific,
> I needed a generic way for Panthor to allocate from such heap.
>
> In some scenarios it might be a carved-out memory, in others a FW will
> reside in the system (TEE) and require a secure heap driver to allocate
> memory (e.g: a similar approach is followd by MTK). Such driver would
> implement the allocation and free logic.
>
> Florent
>
>
>
I can't see patch #1 in my inbox for some reason, but I already know
what it does from your repository.
Feel free to add Reviewed-by: Christian König <christian.koenig(a)amd.com>
to the entire series.
Regards,
Christian.
Am 31.01.25 um 12:02 schrieb Pierre-Eric Pelloux-Prayer:
> Hi,
>
> The initial goal of this series was to improve the drm and amdgpu
> trace events to be able to expose more of the inner workings of
> the scheduler and drivers to developers via tools.
>
> Then, the series evolved to become focused only on gpu_scheduler.
> The changes around vblank events will be part of a different
> series, as well as the amdgpu ones.
>
> Moreover Sima suggested to make some trace events stable uAPI,
> so tools can rely on them long term.
>
> The first patches extend and cleanup the gpu scheduler events.
>
> The last one adds a documentation entry in drm-uapi.rst.
>
> Changes since v6:
> * Addressed comments from Philipp, Tvrtko, Steven
> * The commit storing drm_client_id in sched_fence adds an extra
> field which looks like a duplicate of the owner field. Currently
> amdgpu uses the owner field with some magic values, so we have to
> have 2 separate fields for now, but ultimately one could be removed.
> Similarly storing the drm_client_id in the sched_entity is not
> really possible as there's nothing preventing a driver to use a
> sched_entity with multiple drm_file instances.
>
>
> Useful links:
> - userspace tool using the updated events:
> https://gitlab.freedesktop.org/tomstdenis/umr/-/merge_requests/37
> - v6:
> https://lists.freedesktop.org/archives/dri-devel/2024-November/477644.html
>
> Pierre-Eric Pelloux-Prayer (7):
> drm/debugfs: output client_id in in drm_clients_info
> drm/sched: store the drm client_id in drm_sched_fence
> drm/sched: add device name to the drm_sched_process_job event
> drm/sched: cleanup gpu_scheduler trace events
> drm/sched: trace dependencies for gpu jobs
> drm/sched: add the drm_client_id to the drm_sched_run/exec_job events
> drm/doc: document some tracepoints as uAPI
>
> Documentation/gpu/drm-uapi.rst | 19 +++
> drivers/accel/amdxdna/aie2_ctx.c | 3 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +-
> drivers/gpu/drm/drm_debugfs.c | 10 +-
> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +-
> drivers/gpu/drm/imagination/pvr_job.c | 2 +-
> drivers/gpu/drm/imagination/pvr_queue.c | 5 +-
> drivers/gpu/drm/imagination/pvr_queue.h | 2 +-
> drivers/gpu/drm/lima/lima_gem.c | 2 +-
> drivers/gpu/drm/lima/lima_sched.c | 6 +-
> drivers/gpu/drm/lima/lima_sched.h | 3 +-
> drivers/gpu/drm/msm/msm_gem_submit.c | 8 +-
> drivers/gpu/drm/nouveau/nouveau_sched.c | 3 +-
> drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +-
> drivers/gpu/drm/panthor/panthor_drv.c | 3 +-
> drivers/gpu/drm/panthor/panthor_mmu.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 5 +-
> drivers/gpu/drm/panthor/panthor_sched.h | 3 +-
> .../gpu/drm/scheduler/gpu_scheduler_trace.h | 123 ++++++++++++++----
> drivers/gpu/drm/scheduler/sched_entity.c | 8 +-
> drivers/gpu/drm/scheduler/sched_fence.c | 4 +-
> drivers/gpu/drm/scheduler/sched_main.c | 8 +-
> drivers/gpu/drm/v3d/v3d_submit.c | 2 +-
> drivers/gpu/drm/xe/xe_sched_job.c | 3 +-
> include/drm/gpu_scheduler.h | 12 +-
> 28 files changed, 192 insertions(+), 64 deletions(-)
>
On Thu, Jan 30, 2025 at 01:08:57PM +0000, Florent Tomasin wrote:
> Introduce a CMA Heap dt-binding allowing custom
> CMA heap registrations.
>
> * Note to the reviewers:
> The patch was used for the development of the protected mode
> feature in Panthor CSF kernel driver and is not initially thought
> to land in the Linux kernel. It is mostly relevant if someone
> wants to reproduce the environment of testing. Please, raise
> interest if you think the patch has value in the Linux kernel.
Why would panthor need CMA, it has an MMU.
In any case, I agree with Maxime that this is redundant.
Rob
Hi,
I started to review it, but it's probably best to discuss it here.
On Thu, Jan 30, 2025 at 01:08:56PM +0000, Florent Tomasin wrote:
> Hi,
>
> This is a patch series covering the support for protected mode execution in
> Mali Panthor CSF kernel driver.
>
> The Mali CSF GPUs come with the support for protected mode execution at the
> HW level. This feature requires two main changes in the kernel driver:
>
> 1) Configure the GPU with a protected buffer. The system must provide a DMA
> heap from which the driver can allocate a protected buffer.
> It can be a carved-out memory or dynamically allocated protected memory region.
> Some system includes a trusted FW which is in charge of the protected memory.
> Since this problem is integration specific, the Mali Panthor CSF kernel
> driver must import the protected memory from a device specific exporter.
Why do you need a heap for it in the first place? My understanding of
your series is that you have a carved out memory region somewhere, and
you want to allocate from that carved out memory region your buffers.
How is that any different from using a reserved-memory region, adding
the reserved-memory property to the GPU device and doing all your
allocation through the usual dma_alloc_* API?
Or do you expect to have another dma-buf heap that would call into the
firmware to perform the allocations?
The semantics of the CMA heap allocations is a concern too.
Another question is how would you expect something like a secure
video-playback pipeline to operate with that kind of interface. Assuming
you have a secure codec, you would likely get that protected buffer from
the codec, right? So the most likely scenario would be to then import
that dma-buf into the GPU driver, but not allocate the buffer from it.
Overall, I think a better abstraction would be to have a heap indeed to
allocate your protected buffers from, and then import them in the
devices that need them. The responsibility would be on the userspace to
do so, but it already kind of does with your design anyway.
Maxime
Hi,
On Thu, Jan 30, 2025 at 01:08:58PM +0000, Florent Tomasin wrote:
> This patch introduces a cma-heap probe function, allowing
> users to register custom cma heaps in the device tree.
>
> A "memory-region" is bound to the cma heap at probe time
> allowing allocation of DMA buffers from that heap.
>
> Use cases:
> - registration of carved out secure heaps. Some devices
> are implementing secure memory by reserving a specific
> memory regions for that purpose. For example, this is the
> case of platforms making use of early version of
> ARM TrustZone.
In such a case, the CMA heap would de-facto become un-mappable for
userspace, right?
> - registration of multiple memory regions at different
> locations for efficiency or HW integration reasons.
> For example, a peripheral may expect to share data at a
> specific location in RAM. This information could have been
> programmed by a FW prior to the kernel boot.
How would you differentiate between them?
Maxime
On 30/01/2025 14:08, Florent Tomasin wrote:
> Allow mali-valhall-csf driver to retrieve a protected
> heap at probe time by passing the name of the heap
> as attribute to the device tree GPU node.
Please wrap commit message according to Linux coding style / submission
process (neither too early nor over the limit):
https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/process/subm…
Why this cannot be passed by phandle, just like all reserved regions?
From where do you take these protected heaps? Firmware? This would
explain why no relation is here (no probe ordering, no device links,
nothing connecting separate devices).
Best regards,
Krzysztof
Even the kerneldoc says that with a zero timeout the function should not
wait for anything, but still return 1 to indicate that the fences are
signaled now.
Unfortunately that isn't what was implemented, instead of only returning
1 we also waited for at least one jiffies.
Fix that by adjusting the handling to what the function is actually
documented to do.
v2: improve code readability
Reported-by: Marek Olšák <marek.olsak(a)amd.com>
Reported-by: Lucas Stach <l.stach(a)pengutronix.de>
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Cc: <stable(a)vger.kernel.org>
---
drivers/dma-buf/dma-resv.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 5f8d010516f0..c78cdae3deaf 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -684,11 +684,13 @@ long dma_resv_wait_timeout(struct dma_resv *obj, enum dma_resv_usage usage,
dma_resv_iter_begin(&cursor, obj, usage);
dma_resv_for_each_fence_unlocked(&cursor, fence) {
- ret = dma_fence_wait_timeout(fence, intr, ret);
- if (ret <= 0) {
- dma_resv_iter_end(&cursor);
- return ret;
- }
+ ret = dma_fence_wait_timeout(fence, intr, timeout);
+ if (ret <= 0)
+ break;
+
+ /* Even for zero timeout the return value is 1 */
+ if (timeout)
+ timeout = ret;
}
dma_resv_iter_end(&cursor);
--
2.34.1
Even the kerneldoc says that with a zero timeout the function should not
wait for anything, but still return 1 to indicate that the fences are
signaled now.
Unfortunately that isn't what was implemented, instead of only returning
1 we also waited for at least one jiffies.
Fix that by adjusting the handling to what the function is actually
documented to do.
Reported-by: Marek Olšák <marek.olsak(a)amd.com>
Reported-by: Lucas Stach <l.stach(a)pengutronix.de>
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Cc: <stable(a)vger.kernel.org>
---
drivers/dma-buf/dma-resv.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 5f8d010516f0..ae92d9d2394d 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -684,11 +684,12 @@ long dma_resv_wait_timeout(struct dma_resv *obj, enum dma_resv_usage usage,
dma_resv_iter_begin(&cursor, obj, usage);
dma_resv_for_each_fence_unlocked(&cursor, fence) {
- ret = dma_fence_wait_timeout(fence, intr, ret);
- if (ret <= 0) {
- dma_resv_iter_end(&cursor);
- return ret;
- }
+ ret = dma_fence_wait_timeout(fence, intr, timeout);
+ if (ret <= 0)
+ break;
+
+ /* Even for zero timeout the return value is 1 */
+ timeout = min(timeout, ret);
}
dma_resv_iter_end(&cursor);
--
2.34.1
Am 22.01.25 um 12:04 schrieb Simona Vetter:
> On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
>> On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
>>> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
>>>> On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
>>>>> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
>>>>>> On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
>>>>>> What is going wrong with your email? You replied to Simona, but
>>>>>> Simona Vetter <simona.vetter(a)ffwll.ch> is dropped from the To/CC
>>>>>> list??? I added the address back, but seems like a weird thing to
>>>>>> happen.
>>>>> Might also be funny mailing list stuff, depending how you get these. I
>>>>> read mails over lore and pretty much ignore cc (unless it's not also on
>>>>> any list, since those tend to be security issues) because I get cc'ed on
>>>>> way too much stuff for that to be a useful signal.
>>>> Oh I see, you are sending a Mail-followup-to header that excludes your
>>>> address, so you don't get any emails at all.. My mutt is dropping you
>>>> as well.
I'm having all kind of funny phenomena with AMDs mail servers since
coming back from xmas vacation.
From the news it looks like Outlook on Windows has a new major security
issue where just viewing a mail can compromise the system and my
educated guess is that our IT guys went into panic mode because of this
and has changed something.
>> [SNIP]
>> I have been assuming that dmabuf mmap remains unchanged, that
>> exporters will continue to implement that mmap() callback as today.
That sounds really really good to me because that was my major concern
when you noted that you want to have PFNs to build up KVM page tables.
But you don't want to handle mmap() on your own, you basically don't
want to have a VMA for this stuff at all, correct?
>> My main interest has been what data structure is produced in the
>> attach APIs.
>>
>> Eg today we have a struct dma_buf_attachment that returns a sg_table.
>>
>> I'm expecting some kind of new data structure, lets call it "physical
>> list" that is some efficient coding of meta/addr/len tuples that works
>> well with the new DMA API. Matthew has been calling this thing phyr..
I would not use a data structure at all. Instead we should have
something like an iterator/cursor based approach similar to what the new
DMA API is doing.
>> So, I imagine, struct dma_buf_attachment gaining an optional
>> feature negotiation and then we have in dma_buf_attachment:
>>
>> union {
>> struct sg_table *sgt;
>> struct physical_list *phyr;
>> };
>>
>> That's basicaly it, an alternative to scatterlist that has a clean
>> architecture.
I would rather suggest something like dma_buf_attachment() gets offset
and size to map and returns a cursor object you can use to get your
address, length and access attributes.
And then you can iterate over this cursor and fill in your importer data
structure with the necessary information.
This way neither the exporter nor the importer need to convert their
data back and forth between their specific representations of the
information.
>> Now, if you are asking if the current dmabuf mmap callback can be
>> improved with the above? Maybe? phyr should have the neccessary
>> information inside it to populate a VMA - eventually even fully
>> correctly with all the right cachable/encrypted/forbidden/etc flags.
That won't work like this.
See the exporter needs to be informed about page faults on the VMA to
eventually wait for operations to end and sync caches.
Otherwise we either potentially allow access to freed up or re-used
memory or run into issues with device cache coherency.
>> So, you could imagine that exporters could just have one routine to
>> generate the phyr list and that goes into the attachment, goes into
>> some common code to fill VMA PTEs, and some other common code that
>> will convert it into the DMABUF scatterlist. If performance is not a
>> concern with these data structure conversions it could be an appealing
>> simplification.
>>
>> And yes, I could imagine the meta information being descriptive enough
>> to support the private interconnect cases, the common code could
>> detect private meta information and just cleanly fail.
> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse. But that aside sounds all reasonable, and even that
> could just be too much worry on my side and mixing phyr into existing
> attachments (with a pile of importer/exporter flags probably) is fine.
I lean into the other direction.
Dmitry and Thomas have done a really good job at cleaning up all the
interaction between dynamic and static exporters / importers.
Especially that we now have consistent locking for map_dma_buf() and
unmap_dma_buf() should make that transition rather straight forward.
> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually. Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.
The point is we have use cases that won't work without exchanging DMA
addresses any more.
For example we have cases with multiple devices are in the same IOMMU
domain and re-using their DMA address mappings.
> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.
I would rather say we should stick to DMA addresses as much as possible.
What we can do is to add an address space description to the addresses,
e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device
private bus addr or in the case of sharing with iommufd and KVM PFNs.
Regards,
Christian.
>>> At least the device mapping / dma_buf_attachment
>>> side should be doable with just the pfn and the new dma-api?
>> Yes, that would be my first goal post. Figure out some meta
>> information and a container data structure that allows struct
>> page-less P2P mapping through the new DMA API.
>>
>>>> I'm hoping we can get to something where we describe not just how the
>>>> pfns should be DMA mapped, but also can describe how they should be
>>>> CPU mapped. For instance that this PFN space is always mapped
>>>> uncachable, in CPU and in IOMMU.
>>> I was pondering whether dma_mmap and friends would be a good place to
>>> prototype this and go for a fully generic implementation. But then even
>>> those have _wc/_uncached variants.
>> Given that the inability to correctly DMA map P2P MMIO without struct
>> page is a current pain point and current source of hacks in dmabuf
>> exporters, I wanted to make resolving that a priority.
>>
>> However, if you mean what I described above for "fully generic [dmabuf
>> mmap] implementation", then we'd have the phyr datastructure as a
>> dependency to attempt that work.
>>
>> phyr, and particularly the meta information, has a number of
>> stakeholders. I was thinking of going first with rdma's memory
>> registration flow because we are now pretty close to being able to do
>> such a big change, and it can demonstrate most of the requirements.
>>
>> But that doesn't mean mmap couldn't go concurrently on the same agreed
>> datastructure if people are interested.
> Yeah cpu mmap needs a lot more, going with a very limited p2p use-case
> first only makes sense.
>
>>>> We also have current bugs in the iommu/vfio side where we are fudging
>>>> CC stuff, like assuming CPU memory is encrypted (not always true) and
>>>> that MMIO is non-encrypted (not always true)
>>> tbf CC pte flags I just don't grok at all. I've once tried to understand
>>> what current exporters and gpu drivers do and just gave up. But that's
>>> also a bit why I'm worried here because it's an enigma to me.
>> For CC, inside the secure world, is some information if each PFN
>> inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
>> pointing at the PFN must match the secure world's view of
>> 'encrypted'. The VM can ask the secure world to change its view at
>> runtime.
>>
>> The way CC has been bolted on to the kernel so far laregly hides this
>> from drivers, so it is troubled to tell in driver code if the PFN you
>> have is 'encrypted' or not. Right now the general rule (that is not
>> always true) is that struct page CPU memory is encrypted and
>> everything else is decrypted.
>>
>> So right now, you can mostly ignore it and the above assumption
>> largely happens for you transparently.
>>
>> However, soon we will have encrypted P2P MMIO which will stress this
>> hiding strategy.
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.
>
>>>>> I thought iommuv2 (or whatever linux calls these) has full fault support
>>>>> and could support current move semantics. But yeah for iommu without
>>>>> fault support we need some kind of pin or a newly formalized revoke model.
>>>> No, this is HW dependent, including PCI device, and I'm aware of no HW
>>>> that fully implements this in a way that could be useful to implement
>>>> arbitary move semantics for VFIO..
>>> Hm I thought we've had at least prototypes floating around of device fault
>>> repair, but I guess that only works with ATS/pasid stuff and not general
>>> iommu traffic from devices. Definitely needs some device cooperation since
>>> the timeouts of a full fault are almost endless.
>> Yes, exactly. What all real devices I'm aware have done is make a
>> subset of their traffic work with ATS and PRI, but not all their
>> traffic. Without *all* traffic you can't make any generic assumption
>> in the iommu that a transient non-present won't be fatal to the
>> device.
>>
>> Stuff like dmabuf move semantics rely on transient non-present being
>> non-disruptive...
> Ah now I get it, at the iommu level you have to pessimistically assume
> whether a device can handle a fault, and none can for all traffic. I was
> thinking too much about the driver level where generally the dma-buf you
> importer are only used for the subset of device functions that can cope
> with faults on many devices.
>
> Cheers, Sima
On Thu, Jan 23, 2025 at 03:41:58PM +0800, Xu Yilun wrote:
> I don't have a complete idea yet. But the goal is not to make any
> existing driver seamlessly work with secure device. It is to provide a
> generic way for bind/attestation/accept, and may save driver's effort
> if they don't care about this startup process. There are plenty of
> operations that a driver can't do to a secure device, FLR is one of
> them. The TDISP SPEC has described some general rules but some are even
> device specific.
You can FLR a secure device, it just has to be re-secured and
re-attested after. Otherwise no VFIO for you.
> So I think a driver (including VFIO) expects change to support trusted
> device, but may not have to cover bind/attestation/accept flow.
I expect changes, but not fundamental ones. VFIO will still have to
FLR devices as part of it's security architecture.
The entire flow needs to have options for drivers to be involved in
the flow, somehow.
Jason
On Wed, Jan 22, 2025 at 12:04:19PM +0100, Simona Vetter wrote:
> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse.
Lets see when some patches come up, if importers have to deal with
several formats a single attach interface would probably be simpler
for them..
> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually.
IMHO the least churn would be to have the old importers call a helper
(perhaps implicitly in the core code) to convert phyr into a dma
mapped scatterlist and then just keep using their existing code
exactly as is.
Performance improvement would come from importers switching to use the
new dma api internally and avoiding the scatterlist allocation, but
even the extra alocation is not so bad.
Then, perhaps, and I really hesitate to say this, but perhaps to ease
the migration we could store a dma mapped list in a phyr using the
meta information. That would allow a dmabuf scatterlist exporter to be
converted to a phyr with a helper. The importer would have to have a
special path to detect the dma mapped mode and skip the normal
mapping.
The point would be to let maintained drivers use the new data
structures and flows easily and still interoperate with older
drivers. Otherwise we have to somehow build parallel scatterlist and
phyr code flows into importers :\
> Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.
I'd hope we can reach every case.. But I don't know what kind of
horrors you guys have :)
> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.
I'd say getting a struct page is a perfectly safe operation, from a
phyr API perspective.
The dmabuf issue seems to be entirely about following the locking
rules, an importer is not allowed to get a struct page and switch from
reservation locking to page refcount locking.
I get the appeal for DRM of blocking struct page use because that
directly prevents the above. But, in RDMA I want to re-use the same
phyr datastructure for tracking pin_user_pages() CPU memory, and I
must get the struct page back so I can put_user_page() it when done.
Perhaps we can find some compromise where the phyr data structure has
some kind of flag 'disable struct page' and then the
phyr_entry_to_page() API will WARN_ON() or something like that.
> > However, soon we will have encrypted P2P MMIO which will stress this
> > hiding strategy.
>
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.
IMHO, most places I've seen touching this out side arch code feel very
hacky. :\
Jason
On Wed, Jan 22, 2025 at 12:32:56PM +0800, Xu Yilun wrote:
> On Tue, Jan 21, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> >
> > > When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> > > subsystem) has already moved the device to RUN state. So VFIO & DPDK
> > > are all TEE users, no need to manipulate TDISP state between them.
> > > AFAICS, this is the most preferred TIO usage in CoCo-VM.
> >
> > No, unfortunately. Part of the motivation to have the devices be
> > unlocked when the VM starts is because there is an expectation that a
> > driver in the VM will need to do untrusted operations to boot up the
>
> I assume these operations are device specific.
Yes
> > device before it can be switched to the run state.
> >
> > So any vfio use case needs to imagine that VFIO starts with an
> > untrusted device, does stuff to it, then pushes everything through to
>
> I have concern that VFIO has to do device specific stuff. Our current
> expectation is a specific device driver deals with the untrusted
> operations, then user writes a 'bind' device sysfs node which detaches
> the driver for untrusted, do the attestation and accept, and try match
> the driver for trusted (e.g. VFIO).
I don't see this as working, VFIO will FLR the device which will
destroy anything that was done prior.
VFIO itself has to do the sequence and the VFIO userspace has to
contain the device specific stuff.
The bind/unbind dance for untrusted->trusted would need to be
internalized in VFIO without unbinding. The main motivation for the
bind/unbind flow was to manage the DMA API, which VFIO does not use.
Jason
On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> subsystem) has already moved the device to RUN state. So VFIO & DPDK
> are all TEE users, no need to manipulate TDISP state between them.
> AFAICS, this is the most preferred TIO usage in CoCo-VM.
No, unfortunately. Part of the motivation to have the devices be
unlocked when the VM starts is because there is an expectation that a
driver in the VM will need to do untrusted operations to boot up the
device before it can be switched to the run state.
So any vfio use case needs to imagine that VFIO starts with an
untrusted device, does stuff to it, then pushes everything through to
run. The exact mirror as what a kernel driver should be able to do.
How exactly all this very complex stuff works, I have no idea, but
this is what I've understood is the target. :\
Jason