On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:
> E.g. if a compositor gets a dma-buf it assumes that by just binding that
> it will not risk gpu context destruction (unless you're out of memory and
> everything is on fire anyway, and it's ok to die). But if a nasty client
> app supplies a revocable dma-buf, then it can shot down the higher
> priviledged compositor gpu workload with precision. Which is not great, so
> maybe existing dynamic gpu importers should reject revocable dma-buf.
> That's at least what I had in mind as a potential issue.
I see, so it is not that they can't handle a non-present fault it is
just that the non-present effectively turns into a crash of the
context and you want to avoid the crash. It makes sense to me to
negotiate this as part of the API.
> > This is similar to the structure BIO has, and it composes nicely with
> > a future pin_user_pages() and memfd_pin_folios().
>
> Since you mention pin here, I think that's another aspect of the revocable
> vs dynamic question. Dynamic buffers are expected to sometimes just move
> around for no reason, and importers must be able to cope.
Yes, and we have importers that can tolerate dynamic and those that
can't. Though those that can't tolerate it can often implement revoke.
I view your list as a cascade:
1) Fully pinned can never be changed so long as the attach is present
2) Fully pinned, but can be revoked. Revoked is a fatal condition and
the importer is allowed to experience an error
3) Fully dynamic and always present. Support for move, and
restartable fault, is required
Today in RDMA we ask the exporter if it is 1 or 3 and allow different
things. I've seen the GPU side start to offer 1 more often as it has
significant performance wins.
> For recovable exporters/importers I'd expect that movement is not
> happening, meaning it's pinned until the single terminal revocation. And
> maybe I read the kvm stuff wrong, but it reads more like the latter to me
> when crawling through the pfn code.
kvm should be fully faultable and it should be able handle move. It
handles move today using the mmu notifiers after all.
kvm would need to interact with the dmabuf reservations on its page
fault path.
iommufd cannot be faultable and it would only support revoke. For VFIO
revoke would not be fully terminal as VFIO can unrevoke too
(sigh). If we make revoke special I'd like to eventually include
unrevoke for this reason.
> Once we have the lifetime rules nailed then there's the other issue of how
> to describe the memory, and my take for that is that once the dma-api has
> a clear answer we'll just blindly adopt that one and done.
This is what I hope, we are not there yet, first Leon's series needs
to get merged then we can start on making the DMA API P2P safe without
any struct page. From there it should be clear what direction things
go in.
DMABUF would return pfns annotated with whatever matches the DMA API,
and the importer would be able to inspect the PFNs to learn
information like their P2Pness, CPU mappability or whatever.
I'm pushing for the extra struct, and Christoph has been thinking
about searching a maple tree on the PFN. We need to see what works best.
> And currently with either dynamic attachments and dma_addr_t or through
> fishing the pfn from the cpu pagetables there's some very clearly defined
> lifetime and locking rules (which kvm might get wrong, I've seen some
> discussions fly by where it wasn't doing a perfect job with reflecting pte
> changes, but that was about access attributes iirc).
Wouldn't surprise me, mmu notifiers are very complex all around. We've
had bugs already where the mm doesn't signal the notifiers at the
right points.
> If we add something
> new, we need clear rules and not just "here's the kvm code that uses it".
> That's how we've done dma-buf at first, and it was a terrible mess of
> mismatched expecations.
Yes, that would be wrong. It should be self defined within dmabuf and
kvm should adopt to it, move semantics and all.
My general desire is to move all of RDMA's MR process away from
scatterlist and work using only the new DMA API. This will save *huge*
amounts of memory in common workloads and be the basis for non-struct
page DMA support, including P2P.
Jason
On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > is needed so the secure world can prepare anything it needs prior to
> > starting the VM.
>
> OK. From Dan's patchset there are some touch point for vendor tsm
> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>
> Maybe we could move to Dan's thread for discussion.
>
> https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548…
I think Dan's series is different, any uapi from that series should
not be used in the VMM case. We need proper vfio APIs for the VMM to
use. I would expect VFIO to be calling some of that infrastructure.
Really, I don't see a clear sense of how this will look yet. AMD
provided some patches along these lines, I have not seem ARM and Intel
proposals yet, not do I sense there is alignment.
> > Setting up secure vIOMMU emulation, for instance. I
>
> I think this could be done at VM late bind time.
The vIOMMU needs to be setup before the VM boots
> > secure. This should all be pre-arranged as possible before starting
>
> But our current implementation is not to prepare as much as possible,
> but only necessary, so most of the secure work for vPCI function is done
> at late bind time.
That's fine too, but both options need to be valid.
Jason
On Sat, Jan 11, 2025 at 11:48:06AM +0800, Xu Yilun wrote:
> > > > can be sure what is the correct UAPI. In other words, make the
> > > > VFIO device into a CC device should also prevent mmaping it and so on.
> > >
> > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
> >
> > I think you need to start the TDI process much earlier. Some arches
> > are going to need work to prepare the TDI before the VM is started.
>
> Could you elaborate more on that? AFAICS Intel & AMD are all good on
> "late bind", but not sure for other architectures.
I'm not sure about this, the topic has been confused a bit, and people
often seem to misunderstand what the full scenario actually is. :\
What I'm talking abou there is that you will tell the secure world to
create vPCI function that has the potential to be secure "TDI run"
down the road. The VM will decide when it reaches the run state. This
is needed so the secure world can prepare anything it needs prior to
starting the VM. Setting up secure vIOMMU emulation, for instance. I
expect ARM will need this, I'd be surprised if AMD actually doesn't in
the full scenario with secure viommu.
It should not be a surprise to the secure world after the VM has
started that suddenly it learns about a vPCI function that wants to be
secure. This should all be pre-arranged as possible before starting
the VM, even if alot of steps happen after the VM starts running (or
maybe don't happen at all).
Jason
On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
> So if I'm getting this right, what you need from a functional pov is a
> dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
> is not going to work I guess?
Don't want something TDX specific!
There is a general desire, and CC is one, but there are other
motivations like performance, to stop using VMAs and mmaps as a way to
exchanage memory between two entities. Instead we want to use FDs.
We now have memfd and guestmemfd that are usable with
memfd_pin_folios() - this covers pinnable CPU memory.
And for a long time we had DMABUF which is for all the other wild
stuff, and it supports movable memory too.
So, the normal DMABUF semantics with reservation locking and move
notifiers seem workable to me here. They are broadly similar enough to
the mmu notifier locking that they can serve the same job of updating
page tables.
> Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> memory model:
> - permanently pinned dma-buf, they never move
> - dynamic dma-buf, they move through ->move_notify and importers can remap
> - revocable dma-buf, which thus far only exist for pci mmio resources
I would like to see the importers be able to discover which one is
going to be used, because we have RDMA cases where we can support 1
and 3 but not 2.
revocable doesn't require page faulting as it is a terminal condition.
> Since we're leaning even more on that 3rd model I'm wondering whether we
> should make it something official. Because the existing dynamic importers
> do very much assume that re-acquiring the memory after move_notify will
> work. But for the revocable use-case the entire point is that it will
> never work.
> I feel like that's a concept we need to make explicit, so that dynamic
> importers can reject such memory if necessary.
It strikes me as strange that HW can do page faulting, so it can
support #2, but it can't handle a non-present fault?
> So yeah there's a bunch of tricky lifetime questions that need to be
> sorted out with proper design I think, and the current "let's just use pfn
> directly" proposal hides them all under the rug.
I don't think these two things are connected. The lifetime model that
KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
definately should be reviewed and evaluated.
But it is completely orthogonal to allowing iommufd and kvm to access
the CPU PFN to use in their mapping flows, instead of the
dma_addr_t.
What I want to get to is a replacement for scatter list in DMABUF that
is an array of arrays, roughly like:
struct memory_chunks {
struct memory_p2p_provider *provider;
struct bio_vec addrs[];
};
int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);
This can represent all forms of memory: P2P, private, CPU, etc and
would be efficient with the new DMA API.
This is similar to the structure BIO has, and it composes nicely with
a future pin_user_pages() and memfd_pin_folios().
Jason
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote:
> On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > > exporter mapping is possible
> > > >
> > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > > confirm that this isn't true.
> > >
> > > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > > dma-buf must have a CPU PFN.
> >
> > Yes, the final target for KVM is still the CPU PFN, just with the help
> > of CPU mapping table.
> >
> > I also found the xen gntdev-dmabuf is calculating pfn from mapped
> > sgt.
>
> See the comment, it's ok because it's a fake device with fake iommu and
> the xen code has special knowledge to peek behind the curtain.
/*
* Now convert sgt to array of gfns without accessing underlying pages.
* It is not allowed to access the underlying struct page of an sg table
* exported by DMA-buf, but since we deal with special Xen dma device here
* (not a normal physical one) look at the dma addresses in the sg table
* and then calculate gfns directly from them.
*/
for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr)));
*barf*
Can we please all agree that is a horrible abuse of the DMA API and
lets not point it as some acceptable "solution"? KVM and iommufd do
not have fake struct devices with fake iommus.
Jason
On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote:
> So then we face with the shared <-> private device conversion in CoCo VM,
> and in turn shared <-> private MMIO conversion. MMIO region has only one
> physical backend so it is a bit like in-place conversion which is
> complicated. I wanna simply the MMIO conversion routine based on the fact
> that VMM never needs to access assigned MMIO for feature emulation, so
> always disallow userspace MMIO mapping during the whole lifecycle. That's
> why the flag is introduced.
The VMM can simply not map it if for these cases. As part of the TDI
flow the kernel can validate it is not mapped.
> > can be sure what is the correct UAPI. In other words, make the
> > VFIO device into a CC device should also prevent mmaping it and so on.
>
> My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
I think you need to start the TDI process much earlier. Some arches
are going to need work to prepare the TDI before the VM is started.
The other issue here is that Intel is somewhat different from others
and when we build uapi for TDI it has to accommodate everyone.
> Yes. It carries out the idea of "KVM maps MMIO resources without firstly
> mapping into the host" even for normal VM. That's why I think it could
> be an independent patchset.
Yes, just remove this patch and other TDI focused stuff. Just
infrastructure to move to FD based mapping instead of VMA.
Jason
On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > for private assignment. For these private assigned devices, disallow
> > > host accessing their MMIO resources.
> >
> > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > to enforce this?
>
> MM.. maybe I should not say 'host', instead 'userspace'.
>
> I think the kernel part VMM (KVM) has the responsibility to enforce the
> correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> touch private memory/MMIO intentionally or accidently. IIUC that's one
> of the initiative guest_memfd is introduced for private memory. Private
> MMIO follows.
Okay, but then why is it a flag like that? I'm expecting a much
broader system here to make the VFIO device into a confidential device
(like setup the TDI) where we'd have to enforce the private things,
communicate with some secure world to assign it, and so on.
I want to see a fuller solution to the CC problem in VFIO before we
can be sure what is the correct UAPI. In other words, make the
VFIO device into a CC device should also prevent mmaping it and so on.
So, I would take this out and defer VFIO enforcment to a series which
does fuller CC enablement of VFIO.
The precursor work should just be avoiding requiring a VMA when
installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
using a FD to get the CPU pfns into iommufd and kvm as you are
showing.
This works just fine for non-CC devices anyhow and is the necessary
building block for making a TDI interface in VFIO.
Jason
Am 09.01.25 um 14:37 schrieb Philipp Stanner:
> From: Philipp Stanner <pstanner(a)redhat.com>
>
> drm_sched_backend_ops.run_job() returns a dma_fence for the scheduler.
> That fence is signalled by the driver once the hardware completed the
> associated job. The scheduler does not increment the reference count on
> that fence, but implicitly expects to inherit this fence from run_job().
>
> This is relatively subtle and prone to misunderstandings.
>
> This implies that, to keep a reference for itself, a driver needs to
> call dma_fence_get() in addition to dma_fence_init() in that callback.
>
> It's further complicated by the fact that the scheduler even decrements
> the refcount in drm_sched_run_job_work() since it created a new
> reference in drm_sched_fence_scheduled(). It does, however, still use
> its pointer to the fence after calling dma_fence_put() - which is safe
> because of the aforementioned new reference, but actually still violates
> the refcounting rules.
>
> Improve the explanatory comment for that decrement.
>
> Move the call to dma_fence_put() to the position behind the last usage
> of the fence.
>
> Document the necessity to increment the reference count in
> drm_sched_backend_ops.run_job().
>
> Signed-off-by: Philipp Stanner <pstanner(a)redhat.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++---
> include/drm/gpu_scheduler.h | 19 +++++++++++++++----
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 57da84908752..5f46c01eb01e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1218,15 +1218,19 @@ static void drm_sched_run_job_work(struct work_struct *w)
> drm_sched_fence_scheduled(s_fence, fence);
>
> if (!IS_ERR_OR_NULL(fence)) {
> - /* Drop for original kref_init of the fence */
> - dma_fence_put(fence);
> -
> r = dma_fence_add_callback(fence, &sched_job->cb,
> drm_sched_job_done_cb);
> if (r == -ENOENT)
> drm_sched_job_done(sched_job, fence->error);
> else if (r)
> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r);
> +
> + /*
> + * s_fence took a new reference to fence in the call to
> + * drm_sched_fence_scheduled() above. The reference passed by
> + * run_job() above is now not needed any longer. Drop it.
> + */
> + dma_fence_put(fence);
> } else {
> drm_sched_job_done(sched_job, IS_ERR(fence) ?
> PTR_ERR(fence) : 0);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 95e17504e46a..d5cd2a78f27c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -420,10 +420,21 @@ struct drm_sched_backend_ops {
> struct drm_sched_entity *s_entity);
>
> /**
> - * @run_job: Called to execute the job once all of the dependencies
> - * have been resolved. This may be called multiple times, if
> - * timedout_job() has happened and drm_sched_job_recovery()
> - * decides to try it again.
> + * @run_job: Called to execute the job once all of the dependencies
> + * have been resolved. This may be called multiple times, if
> + * timedout_job() has happened and drm_sched_job_recovery() decides to
> + * try it again.
I just came to realize that this hack with calling run_job multiple
times won't work any more with this patch here.
The background is that you can't allocate memory for a newly returned
fence and as far as I know no driver pre allocates multiple HW fences
for a job.
So at least amdgpu used to re-use the same HW fence as return over and
over again, just re-initialized the reference count. I removed that hack
from amdgpu, but just FYI it could be that other driver did the same.
Apart from that concern I think that this patch is really the right
thing and that driver hacks relying on the order of dropping references
are fundamentally broken approaches.
So feel free to add Reviewed-by: Christian König <christian.koenig(a)amd.com>.
Regards,
Christian.
> + *
> + * @sched_job: the job to run
> + *
> + * Returns: dma_fence the driver must signal once the hardware has
> + * completed the job ("hardware fence").
> + *
> + * Note that the scheduler expects to 'inherit' its own reference to
> + * this fence from the callback. It does not invoke an extra
> + * dma_fence_get() on it. Consequently, this callback must take a
> + * reference for the scheduler, and additional ones for the driver's
> + * respective needs.
> */
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better
specify in which domain they are applicable, e.g. if they are PCIe bus
addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into
the importer is an absolutely clear NO-GO.
Regards,
Christian.
> So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu(a)linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
> drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
> include/linux/dma-buf.h | 13 ++++++
> 2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> size_t alloc_size = sizeof(struct dma_buf);
> int ret;
>
> - if (WARN_ON(!exp_info->priv || !exp_info->ops
> - || !exp_info->ops->map_dma_buf
> - || !exp_info->ops->unmap_dma_buf
> - || !exp_info->ops->release))
> + if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> + !exp_info->ops->release))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> struct dma_buf_attachment *attach;
> int ret;
>
> - if (WARN_ON(!dmabuf || !dev))
> + if (WARN_ON(!dmabuf))
> + return ERR_PTR(-EINVAL);
> +
> + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> */
> if (dma_buf_attachment_is_dynamic(attach) !=
> dma_buf_is_dynamic(dmabuf)) {
> - struct sg_table *sgt;
> + struct sg_table *sgt = NULL;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> goto err_unlock;
> }
>
> - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> - if (!sgt)
> - sgt = ERR_PTR(-ENOMEM);
> - if (IS_ERR(sgt)) {
> - ret = PTR_ERR(sgt);
> - goto err_unpin;
> + if (dmabuf->ops->map_dma_buf) {
> + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> + if (!sgt)
> + sgt = ERR_PTR(-ENOMEM);
> + if (IS_ERR(sgt)) {
> + ret = PTR_ERR(sgt);
> + goto err_unpin;
> + }
> }
> +
> dma_resv_unlock(attach->dmabuf->resv);
> attach->sgt = sgt;
> attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> }
> EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach: [in] attachment to get pfn from
> + * @pgoff: [in] page offset of the buffer against the start of dma_buf
> + * @pfn: [out] returns the pfn of the buffer
> + * @max_order [out] returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> + struct dma_buf *dmabuf = attach->dmabuf;
> + int ret;
> +
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->get_pfn))
> + return -EINVAL;
> +
> + /*
> + * Open:
> + *
> + * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> + * should be pinned before use, See dma_buf_map_attachment() for
> + * reference.
> + *
> + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> + * need another API to unpin the dma_buf. So just fail out this case.
> + */
> + if (dma_buf_is_dynamic(attach->dmabuf) &&
> + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> + return -ENOENT;
> +
> + dma_resv_lock(attach->dmabuf->resv, NULL);
> + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> + /*
> + * Open:
> + *
> + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> + * content synchronization could be done when the buffer is to be
> + * mapped by importer.
> + */
> + dma_resv_unlock(attach->dmabuf->resv);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
> /**
> * dma_buf_move_notify - notify attachments that DMA-buf is moving
> *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
> attach_count = 0;
>
> list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
> attach_count++;
> }
> dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
> * if the call would block.
> */
>
> + /**
> + * @get_pfn:
> + *
> + * This is called by dma_buf_get_pfn(). It is used to get the pfn
> + * of the buffer positioned by the page offset against the start of
> + * the dma_buf. It can only be called if @attach has been called
> + * successfully.
> + */
> + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> + u64 *pfn, int *max_order);
> +
> /**
> * @release:
> *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
> void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> struct sg_table *sg_table,
> enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order);
>
> int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
> unsigned long);
On Wed, 8 Jan 2025 at 22:27, Simona Vetter <simona.vetter(a)ffwll.ch> wrote:
>
> On Tue, Dec 24, 2024 at 12:05:19PM +0530, Sumit Garg wrote:
> > Hi Simona,
> >
> > On Wed, 18 Dec 2024 at 16:36, Simona Vetter <simona.vetter(a)ffwll.ch> wrote:
> > >
> > > On Tue, Dec 17, 2024 at 11:07:36AM +0100, Jens Wiklander wrote:
> > > > Hi,
> > > >
> > > > This patch set allocates the restricted DMA-bufs via the TEE subsystem.
> > > >
> > > > The TEE subsystem handles the DMA-buf allocations since it is the TEE
> > > > (OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
> > > > restrictions for the memory used for the DMA-bufs.
> > > >
> > > > I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
> > > > DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
> > > > how to allocate the restricted physical memory.
> > > >
> > > > TEE_IOC_RSTMEM_ALLOC takes in addition to a size and flags parameters also
> > > > a use-case parameter. This is used by the backend TEE driver to decide on
> > > > allocation policy and which devices should be able to access the memory.
> > > >
> > > > Three use-cases (Secure Video Playback, Trusted UI, and Secure Video
> > > > Recording) has been identified so far to serve as examples of what can be
> > > > expected. More use-cases can be added in userspace ABI, but it's up to the
> > > > backend TEE drivers to provide the implementation.
> > > >
> > > > Each use-case has it's own restricted memory pool since different use-cases
> > > > requires isolation from different parts of the system. A restricted memory
> > > > pool can be based on a static carveout instantiated while probing the TEE
> > > > backend driver, or dynamically allocated from CMA and made restricted as
> > > > needed by the TEE.
> > > >
> > > > This can be tested on QEMU with the following steps:
> > > > repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
> > > > -b prototype/sdp-v4
> > > > repo sync -j8
> > > > cd build
> > > > make toolchains -j$(nproc)
> > > > make SPMC_AT_EL=1 all -j$(nproc)
> > > > make SPMC_AT_EL=1 run-only
> > > > # login and at the prompt:
> > > > xtest --sdp-basic
> > > >
> > > > The SPMC_AT_EL=1 parameter configures the build with FF-A and an SPMC at
> > > > S-EL1 inside OP-TEE. The parameter can be changed into SPMC_AT_EL=n to test
> > > > without FF-A using the original SMC ABI instead. Please remember to do
> > > > %rm -rf ../trusted-firmware-a/build/qemu
> > > > for TF-A to be rebuilt properly using the new configuration.
> > > >
> > > > https://optee.readthedocs.io/en/latest/building/prerequisites.html
> > > > list dependencies needed to build the above.
> > > >
> > > > The tests are pretty basic, mostly checking that a Trusted Application in
> > > > the secure world can access and manipulate the memory. There are also some
> > > > negative tests for out of bounds buffers etc.
> > >
> > > I think I've dropped this on earlier encrypted dma-buf discussions for
> > > TEE, but can't find one right now ...
> >
> > Thanks for raising this query.
> >
> > >
> > > Do we have some open source userspace for this? To my knowledge we have
> > > two implementations of encrypted/content protected dma-buf in upstream
> > > right now in the amd and intel gpu drivers, and unless I'm mistaken they
> > > both have some minimal userspace supporting EXT_protected_textures:
> >
> > First of all to clarify the support Jens is adding here for allocating
> > restricted shared memory allocation in TEE subsystem is meant to be
> > generic and not specific to only secure media pipeline use-case. Then
> > here we not only have open source test applications but rather open
> > source firmware too (OP-TEE as a Trusted OS) [1] supporting this as a
> > core feature where we maintain a stable and extensible ABI among the
> > kernel and the OP-TEE core.
> >
> > Restricted memory is a feature enforced by hardware specific firewalls
> > where a particular TEE implementation governs which particular block
> > of memory is accessible to a particular peripheral or a CPU running in
> > a higher privileged mode than the Linux kernel. There can be numeric
> > use-cases surrounding that as follows:
> >
> > - Secure media pipeline where the contents gets decrypted and stored
> > in a restricted buffer which are then accessible only to media display
> > pipeline peripherals.
> > - Trusted user interface where a peripheral takes input from the user
> > and stores it in a restricted buffer which then is accessible to TEE
> > implementation only.
> > - Another possible use-case can be for the TEE implementation to store
> > key material in a restricted buffer which is only accessible to the
> > hardware crypto accelerator.
> >
> > I am sure there will be more use-cases related to this feature but
> > those will only be possible once we provide a stable and extensible
> > restricted memory interface among the Linux user-space and the secure
> > world user-space (normally referred to as Trusted Applications).
> >
> > [1] https://github.com/OP-TEE/optee_os/pull/7159
> >
> > >
> > > https://github.com/KhronosGroup/OpenGL-Registry/blob/main/extensions/EXT/EX…
> > >
> > > It's not great, but it does just barely clear the bar in my opinion. I
> > > guess something in gstreamer or similar video pipeline framework would
> > > also do the job.
> > >
> > > Especially with the context of the uapi discussion in the v1/RFC thread I
> > > think we need more than a bare-bones testcase to make sure this works in
> > > actual use.
> >
> > Currently the TEE subsystem already supports a stable ABI for shared
> > memory allocator among Linux user-space and secure world user-space
> > here [2]. And the stable ABI for restricted memory is also along the
> > same lines meant to be a vendor neutral abstraction for the user-space
> > access. The current test cases not only test the interface but also
> > perform regression tests too.
> >
> > I am also in favour of end to end open source use-cases. But I fear
> > without progressing in a step wise manner as with this proposal we
> > would rather force developers to upstream all the software pieces in
> > one go which will be kind of a chicken and egg situation. I am sure
> > once this feature lands Mediatek folks will be interested to port
> > their secure video playback patchset [3] on top of it. Similarly other
> > silicon vendors like NXP, Qcom etc. will be motivated to do the same.
> >
> > [2] https://docs.kernel.org/userspace-api/tee.html
> > [3] https://lore.kernel.org/linux-arm-kernel/20240515112308.10171-1-yong.wu@med…
>
> We get entire opengl/vulkan driver stacks ready before we merge new drm
> drivers, I really don't think this is too hard from a technical pov. And I
> think the mediatek patches had the same issue of lacking userspace for it,
> so that's not moving things forward.
> -Sima
>
Okay fair enough, I think I get your point. Currently we are missing
at least one peripheral support being the consumer for these
restricted DMA-bufs. So I discussed with Jens offline that we can try
with a crypto peripheral use-case first which can simply be
demonstrated using the current OP-TEE client user-space.
Also, in crypto peripheral use-case we can target the symmetric crypto
use-case first which already has a concept of hardware backed
symmetric key [1]. IOW, we should be able to come up with a generic
symmetric crypto algorithm which can be supported by different crypto
accelerators using a TEE backed restricted key DMA buffer.
[1] https://www.youtube.com/watch?v=GbcpwUBFGDw
-Sumit
On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> for private assignment. For these private assigned devices, disallow
> host accessing their MMIO resources.
Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
to enforce this?
Jason
6.12-stable review patch. If anyone has any objections, please let me know.
------------------
From: John Harrison <John.C.Harrison(a)Intel.com>
[ Upstream commit 5dce85fecb87751ec94526e1ac516dd7871e2e0c ]
Adding lockdep checking to the coredump code showed that there was an
existing violation. The dev_coredumpm_timeout() call is used to
register the dump with the base coredump subsystem. However, that
makes multiple memory allocations, only some of which use the GFP_
flags passed in. So that also needs to be deferred to the worker
function where it is safe to allocate with arbitrary flags.
In order to not add protoypes for the callback functions, moving the
_timeout call also means moving the worker thread function to later in
the file.
v2: Rebased after other changes to the worker function.
Fixes: e799485044cb ("drm/xe: Introduce the dev_coredump infrastructure.")
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Francois Dugast <francois.dugast(a)intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom(a)linux.intel.com>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: intel-xe(a)lists.freedesktop.org
Cc: linux-media(a)vger.kernel.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: <stable(a)vger.kernel.org> # v6.8+
Signed-off-by: John Harrison <John.C.Harrison(a)Intel.com>
Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241128210824.3302147-3-John…
(cherry picked from commit 90f51a7f4ec1004fc4ddfbc6d1f1068d85ef4771)
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 73 +++++++++++++++--------------
1 file changed, 39 insertions(+), 34 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index c40c91e27f71..c18e463092af 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -144,36 +144,6 @@ static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
ss->vm = NULL;
}
-static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
-{
- struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
- struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
- struct xe_device *xe = coredump_to_xe(coredump);
- unsigned int fw_ref;
-
- xe_pm_runtime_get(xe);
-
- /* keep going if fw fails as we still want to save the memory and SW data */
- fw_ref = xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
- if (!xe_force_wake_ref_has_domain(fw_ref, XE_FORCEWAKE_ALL))
- xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
- xe_vm_snapshot_capture_delayed(ss->vm);
- xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
- xe_force_wake_put(gt_to_fw(ss->gt), fw_ref);
-
- xe_pm_runtime_put(xe);
-
- /* Calculate devcoredump size */
- ss->read.size = __xe_devcoredump_read(NULL, INT_MAX, coredump);
-
- ss->read.buffer = kvmalloc(ss->read.size, GFP_USER);
- if (!ss->read.buffer)
- return;
-
- __xe_devcoredump_read(ss->read.buffer, ss->read.size, coredump);
- xe_devcoredump_snapshot_free(ss);
-}
-
static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
size_t count, void *data, size_t datalen)
{
@@ -222,6 +192,45 @@ static void xe_devcoredump_free(void *data)
"Xe device coredump has been deleted.\n");
}
+static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
+{
+ struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
+ struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
+ struct xe_device *xe = coredump_to_xe(coredump);
+ unsigned int fw_ref;
+
+ /*
+ * NB: Despite passing a GFP_ flags parameter here, more allocations are done
+ * internally using GFP_KERNEL expliictly. Hence this call must be in the worker
+ * thread and not in the initial capture call.
+ */
+ dev_coredumpm_timeout(gt_to_xe(ss->gt)->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
+ xe_devcoredump_read, xe_devcoredump_free,
+ XE_COREDUMP_TIMEOUT_JIFFIES);
+
+ xe_pm_runtime_get(xe);
+
+ /* keep going if fw fails as we still want to save the memory and SW data */
+ fw_ref = xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
+ if (!xe_force_wake_ref_has_domain(fw_ref, XE_FORCEWAKE_ALL))
+ xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
+ xe_vm_snapshot_capture_delayed(ss->vm);
+ xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
+ xe_force_wake_put(gt_to_fw(ss->gt), fw_ref);
+
+ xe_pm_runtime_put(xe);
+
+ /* Calculate devcoredump size */
+ ss->read.size = __xe_devcoredump_read(NULL, INT_MAX, coredump);
+
+ ss->read.buffer = kvmalloc(ss->read.size, GFP_USER);
+ if (!ss->read.buffer)
+ return;
+
+ __xe_devcoredump_read(ss->read.buffer, ss->read.size, coredump);
+ xe_devcoredump_snapshot_free(ss);
+}
+
static void devcoredump_snapshot(struct xe_devcoredump *coredump,
struct xe_sched_job *job)
{
@@ -305,10 +314,6 @@ void xe_devcoredump(struct xe_sched_job *job)
drm_info(&xe->drm, "Xe device coredump has been created\n");
drm_info(&xe->drm, "Check your /sys/class/drm/card%d/device/devcoredump/data\n",
xe->drm.primary->index);
-
- dev_coredumpm_timeout(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
- xe_devcoredump_read, xe_devcoredump_free,
- XE_COREDUMP_TIMEOUT_JIFFIES);
}
static void xe_driver_devcoredump_fini(void *arg)
--
2.39.5
Hi Lukas,
On Tue, 24 Dec 2024 at 14:58, Lukas Wunner <lukas(a)wunner.de> wrote:
>
> On Tue, Dec 24, 2024 at 12:05:19PM +0530, Sumit Garg wrote:
> > Restricted memory is a feature enforced by hardware specific firewalls
> > where a particular TEE implementation governs which particular block
> > of memory is accessible to a particular peripheral or a CPU running in
> > a higher privileged mode than the Linux kernel.
> [...]
> > - Another possible use-case can be for the TEE implementation to store
> > key material in a restricted buffer which is only accessible to the
> > hardware crypto accelerator.
>
> Just a heads-up:
>
> For RSA sign/verify operations using rsassa-pkcs1 encoding,
> the message to be signed/verified (which I understand could
> be located in restricted memory) is prepended by a padding.
>
> The crypto subsystem does the prepending of the padding in software.
> The actual signature generation/verification (which is an RSA encrypt
> or decrypt operation) may be performed in hardware by a crypto
> accelerator.
>
> Before commit 8552cb04e083 ("crypto: rsassa-pkcs1 - Copy source
> data for SG list"), the kernel constructed a scatterlist
> consisting of the padding on the one hand, and of the message
> to be signed/verified on the other hand. I believe this worked
> for use cases where the message is located in restricted memory.
>
> However since that commit, the kernel kmalloc's a new buffer and
> copies the message to be signed/verified into it. The argument
> was that although the *kernel* may be able to access the data,
> the crypto accelerator may *not* be able to do so. In particular,
> portions of the padding are located in the kernel's .rodata section
> which is a valid virtual address on x86 but not on arm64 and
> which may be inaccessible to a crypto accelerator.
>
> However in the case of restricted memory, the situation is exactly
> the opposite: The kernel may *not* be able to access the data,
> but the crypto accelerator can access it just fine.
>
> I did raise a concern about this to the maintainer, but to no avail:
> https://lore.kernel.org/r/Z1Kym1-9ka8kGHrM@wunner.de/
Herbert's point is valid that there isn't any point for mapping
restricted memory in the kernel virtual address space as any kernel
access to that space can lead to platform specific hardware error
scenarios. And for that reason we simply disallow dma_buf_mmap() and
don't support dma_buf_vmap() for DMA-bufs holding TEE restricted
memory. The only consumers for those DMA-bufs will be the DMA capable
peripherals granted access permissions by the TEE implementation. IOW,
kernel role here will be to just provide the DMA-buf infrastructure
for buffers to be set up by TEE and then setting up DMA addresses for
peripherals to access them. The hardware crypto accelerator can be one
such peripheral.
>
> This is the alternative solution I would have preferred:
> https://lore.kernel.org/r/3de5d373c86dcaa5abc36f501c1398c4fbf05f2f.17328651…
>
> > I am also in favour of end to end open source use-cases. But I fear
> > without progressing in a step wise manner as with this proposal we
> > would rather force developers to upstream all the software pieces in
> > one go which will be kind of a chicken and egg situation. I am sure
> > once this feature lands Mediatek folks will be interested to port
> > their secure video playback patchset [3] on top of it. Similarly other
> > silicon vendors like NXP, Qcom etc. will be motivated to do the same.
>
> The crypto use case may be easier to bring up than the video decoding
> use case because you don't need to implement a huge amount of
> user space code.
Agree, if you already have such an existing hardware use-case then
please feel free to build up on this patch-set.
-Sumit
>
> Thanks,
>
> Lukas
On Tue, Dec 24, 2024 at 10:32:41AM +0100, Lukas Wunner wrote:
> On Tue, Dec 24, 2024 at 10:28:31AM +0100, Lukas Wunner wrote:
> > I did raise a concern about this to the maintainer, but to no avail:
> > https://lore.kernel.org/r/Z1Kym1-9ka8kGHrM@wunner.de/
>
> Sorry, wrong link. This is the one I meant to copy-paste... :(
>
> https://lore.kernel.org/r/Z0rPxCGdD7r8HFKb@wunner.de/
Herbert asked a logical question, which got no response from your side.
--
With best wishes
Dmitry
Hi Simona,
On Wed, 18 Dec 2024 at 16:36, Simona Vetter <simona.vetter(a)ffwll.ch> wrote:
>
> On Tue, Dec 17, 2024 at 11:07:36AM +0100, Jens Wiklander wrote:
> > Hi,
> >
> > This patch set allocates the restricted DMA-bufs via the TEE subsystem.
> >
> > The TEE subsystem handles the DMA-buf allocations since it is the TEE
> > (OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
> > restrictions for the memory used for the DMA-bufs.
> >
> > I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
> > DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
> > how to allocate the restricted physical memory.
> >
> > TEE_IOC_RSTMEM_ALLOC takes in addition to a size and flags parameters also
> > a use-case parameter. This is used by the backend TEE driver to decide on
> > allocation policy and which devices should be able to access the memory.
> >
> > Three use-cases (Secure Video Playback, Trusted UI, and Secure Video
> > Recording) has been identified so far to serve as examples of what can be
> > expected. More use-cases can be added in userspace ABI, but it's up to the
> > backend TEE drivers to provide the implementation.
> >
> > Each use-case has it's own restricted memory pool since different use-cases
> > requires isolation from different parts of the system. A restricted memory
> > pool can be based on a static carveout instantiated while probing the TEE
> > backend driver, or dynamically allocated from CMA and made restricted as
> > needed by the TEE.
> >
> > This can be tested on QEMU with the following steps:
> > repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
> > -b prototype/sdp-v4
> > repo sync -j8
> > cd build
> > make toolchains -j$(nproc)
> > make SPMC_AT_EL=1 all -j$(nproc)
> > make SPMC_AT_EL=1 run-only
> > # login and at the prompt:
> > xtest --sdp-basic
> >
> > The SPMC_AT_EL=1 parameter configures the build with FF-A and an SPMC at
> > S-EL1 inside OP-TEE. The parameter can be changed into SPMC_AT_EL=n to test
> > without FF-A using the original SMC ABI instead. Please remember to do
> > %rm -rf ../trusted-firmware-a/build/qemu
> > for TF-A to be rebuilt properly using the new configuration.
> >
> > https://optee.readthedocs.io/en/latest/building/prerequisites.html
> > list dependencies needed to build the above.
> >
> > The tests are pretty basic, mostly checking that a Trusted Application in
> > the secure world can access and manipulate the memory. There are also some
> > negative tests for out of bounds buffers etc.
>
> I think I've dropped this on earlier encrypted dma-buf discussions for
> TEE, but can't find one right now ...
Thanks for raising this query.
>
> Do we have some open source userspace for this? To my knowledge we have
> two implementations of encrypted/content protected dma-buf in upstream
> right now in the amd and intel gpu drivers, and unless I'm mistaken they
> both have some minimal userspace supporting EXT_protected_textures:
First of all to clarify the support Jens is adding here for allocating
restricted shared memory allocation in TEE subsystem is meant to be
generic and not specific to only secure media pipeline use-case. Then
here we not only have open source test applications but rather open
source firmware too (OP-TEE as a Trusted OS) [1] supporting this as a
core feature where we maintain a stable and extensible ABI among the
kernel and the OP-TEE core.
Restricted memory is a feature enforced by hardware specific firewalls
where a particular TEE implementation governs which particular block
of memory is accessible to a particular peripheral or a CPU running in
a higher privileged mode than the Linux kernel. There can be numeric
use-cases surrounding that as follows:
- Secure media pipeline where the contents gets decrypted and stored
in a restricted buffer which are then accessible only to media display
pipeline peripherals.
- Trusted user interface where a peripheral takes input from the user
and stores it in a restricted buffer which then is accessible to TEE
implementation only.
- Another possible use-case can be for the TEE implementation to store
key material in a restricted buffer which is only accessible to the
hardware crypto accelerator.
I am sure there will be more use-cases related to this feature but
those will only be possible once we provide a stable and extensible
restricted memory interface among the Linux user-space and the secure
world user-space (normally referred to as Trusted Applications).
[1] https://github.com/OP-TEE/optee_os/pull/7159
>
> https://github.com/KhronosGroup/OpenGL-Registry/blob/main/extensions/EXT/EX…
>
> It's not great, but it does just barely clear the bar in my opinion. I
> guess something in gstreamer or similar video pipeline framework would
> also do the job.
>
> Especially with the context of the uapi discussion in the v1/RFC thread I
> think we need more than a bare-bones testcase to make sure this works in
> actual use.
Currently the TEE subsystem already supports a stable ABI for shared
memory allocator among Linux user-space and secure world user-space
here [2]. And the stable ABI for restricted memory is also along the
same lines meant to be a vendor neutral abstraction for the user-space
access. The current test cases not only test the interface but also
perform regression tests too.
I am also in favour of end to end open source use-cases. But I fear
without progressing in a step wise manner as with this proposal we
would rather force developers to upstream all the software pieces in
one go which will be kind of a chicken and egg situation. I am sure
once this feature lands Mediatek folks will be interested to port
their secure video playback patchset [3] on top of it. Similarly other
silicon vendors like NXP, Qcom etc. will be motivated to do the same.
[2] https://docs.kernel.org/userspace-api/tee.html
[3] https://lore.kernel.org/linux-arm-kernel/20240515112308.10171-1-yong.wu@med…
-Sumit
>
> Cheers, Sima
>
> >
> > Thanks,
> > Jens
> >
> > Changes since V3:
> > * Make the use_case and flags field in struct tee_shm u32's instead of
> > u16's
> > * Add more description for TEE_IOC_RSTMEM_ALLOC in the header file
> > * Import namespace DMA_BUF in module tee, reported by lkp(a)intel.com
> > * Added a note in the commit message for "optee: account for direction
> > while converting parameters" why it's needed
> > * Factor out dynamic restricted memory allocation from
> > "optee: support restricted memory allocation" into two new commits
> > "optee: FF-A: dynamic restricted memory allocation" and
> > "optee: smc abi: dynamic restricted memory allocation"
> > * Guard CMA usage with #ifdef CONFIG_CMA, effectively disabling dynamic
> > restricted memory allocate if CMA isn't configured
> >
> > Changes since the V2 RFC:
> > * Based on v6.12
> > * Replaced the flags for SVP and Trusted UID memory with a u32 field with
> > unique id for each use case
> > * Added dynamic allocation of restricted memory pools
> > * Added OP-TEE ABI both with and without FF-A for dynamic restricted memory
> > * Added support for FF-A with FFA_LEND
> >
> > Changes since the V1 RFC:
> > * Based on v6.11
> > * Complete rewrite, replacing the restricted heap with TEE_IOC_RSTMEM_ALLOC
> >
> > Changes since Olivier's post [2]:
> > * Based on Yong Wu's post [1] where much of dma-buf handling is done in
> > the generic restricted heap
> > * Simplifications and cleanup
> > * New commit message for "dma-buf: heaps: add Linaro restricted dmabuf heap
> > support"
> > * Replaced the word "secure" with "restricted" where applicable
> >
> > Jens Wiklander (6):
> > tee: add restricted memory allocation
> > optee: account for direction while converting parameters
> > optee: sync secure world ABI headers
> > optee: support restricted memory allocation
> > optee: FF-A: dynamic restricted memory allocation
> > optee: smc abi: dynamic restricted memory allocation
> >
> > drivers/tee/Makefile | 1 +
> > drivers/tee/optee/Makefile | 1 +
> > drivers/tee/optee/call.c | 10 +-
> > drivers/tee/optee/core.c | 1 +
> > drivers/tee/optee/ffa_abi.c | 178 +++++++++++++-
> > drivers/tee/optee/optee_ffa.h | 27 ++-
> > drivers/tee/optee/optee_msg.h | 65 ++++-
> > drivers/tee/optee/optee_private.h | 75 ++++--
> > drivers/tee/optee/optee_smc.h | 71 +++++-
> > drivers/tee/optee/rpc.c | 31 ++-
> > drivers/tee/optee/rstmem.c | 388 ++++++++++++++++++++++++++++++
> > drivers/tee/optee/smc_abi.c | 213 ++++++++++++++--
> > drivers/tee/tee_core.c | 38 ++-
> > drivers/tee/tee_private.h | 2 +
> > drivers/tee/tee_rstmem.c | 201 ++++++++++++++++
> > drivers/tee/tee_shm.c | 2 +
> > drivers/tee/tee_shm_pool.c | 69 +++++-
> > include/linux/tee_core.h | 15 ++
> > include/linux/tee_drv.h | 2 +
> > include/uapi/linux/tee.h | 44 +++-
> > 20 files changed, 1358 insertions(+), 76 deletions(-)
> > create mode 100644 drivers/tee/optee/rstmem.c
> > create mode 100644 drivers/tee/tee_rstmem.c
> >
> >
> > base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> > --
> > 2.43.0
> >
>
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Hi,
This patch set allocates the restricted DMA-bufs via the TEE subsystem.
The TEE subsystem handles the DMA-buf allocations since it is the TEE
(OP-TEE, AMD-TEE, TS-TEE, or perhaps a future QCOMTEE) which sets up the
restrictions for the memory used for the DMA-bufs.
I've added a new IOCTL, TEE_IOC_RSTMEM_ALLOC, to allocate the restricted
DMA-bufs. This IOCTL reaches the backend TEE driver, allowing it to choose
how to allocate the restricted physical memory.
TEE_IOC_RSTMEM_ALLOC takes in addition to a size and flags parameters also
a use-case parameter. This is used by the backend TEE driver to decide on
allocation policy and which devices should be able to access the memory.
Three use-cases (Secure Video Playback, Trusted UI, and Secure Video
Recording) has been identified so far to serve as examples of what can be
expected. More use-cases can be added in userspace ABI, but it's up to the
backend TEE drivers to provide the implementation.
Each use-case has it's own restricted memory pool since different use-cases
requires isolation from different parts of the system. A restricted memory
pool can be based on a static carveout instantiated while probing the TEE
backend driver, or dynamically allocated from CMA and made restricted as
needed by the TEE.
This can be tested on QEMU with the following steps:
repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
-b prototype/sdp-v4
repo sync -j8
cd build
make toolchains -j$(nproc)
make SPMC_AT_EL=1 all -j$(nproc)
make SPMC_AT_EL=1 run-only
# login and at the prompt:
xtest --sdp-basic
The SPMC_AT_EL=1 parameter configures the build with FF-A and an SPMC at
S-EL1 inside OP-TEE. The parameter can be changed into SPMC_AT_EL=n to test
without FF-A using the original SMC ABI instead. Please remember to do
%rm -rf ../trusted-firmware-a/build/qemu
for TF-A to be rebuilt properly using the new configuration.
https://optee.readthedocs.io/en/latest/building/prerequisites.html
list dependencies needed to build the above.
The tests are pretty basic, mostly checking that a Trusted Application in
the secure world can access and manipulate the memory. There are also some
negative tests for out of bounds buffers etc.
Thanks,
Jens
Changes since V3:
* Make the use_case and flags field in struct tee_shm u32's instead of
u16's
* Add more description for TEE_IOC_RSTMEM_ALLOC in the header file
* Import namespace DMA_BUF in module tee, reported by lkp(a)intel.com
* Added a note in the commit message for "optee: account for direction
while converting parameters" why it's needed
* Factor out dynamic restricted memory allocation from
"optee: support restricted memory allocation" into two new commits
"optee: FF-A: dynamic restricted memory allocation" and
"optee: smc abi: dynamic restricted memory allocation"
* Guard CMA usage with #ifdef CONFIG_CMA, effectively disabling dynamic
restricted memory allocate if CMA isn't configured
Changes since the V2 RFC:
* Based on v6.12
* Replaced the flags for SVP and Trusted UID memory with a u32 field with
unique id for each use case
* Added dynamic allocation of restricted memory pools
* Added OP-TEE ABI both with and without FF-A for dynamic restricted memory
* Added support for FF-A with FFA_LEND
Changes since the V1 RFC:
* Based on v6.11
* Complete rewrite, replacing the restricted heap with TEE_IOC_RSTMEM_ALLOC
Changes since Olivier's post [2]:
* Based on Yong Wu's post [1] where much of dma-buf handling is done in
the generic restricted heap
* Simplifications and cleanup
* New commit message for "dma-buf: heaps: add Linaro restricted dmabuf heap
support"
* Replaced the word "secure" with "restricted" where applicable
Jens Wiklander (6):
tee: add restricted memory allocation
optee: account for direction while converting parameters
optee: sync secure world ABI headers
optee: support restricted memory allocation
optee: FF-A: dynamic restricted memory allocation
optee: smc abi: dynamic restricted memory allocation
drivers/tee/Makefile | 1 +
drivers/tee/optee/Makefile | 1 +
drivers/tee/optee/call.c | 10 +-
drivers/tee/optee/core.c | 1 +
drivers/tee/optee/ffa_abi.c | 178 +++++++++++++-
drivers/tee/optee/optee_ffa.h | 27 ++-
drivers/tee/optee/optee_msg.h | 65 ++++-
drivers/tee/optee/optee_private.h | 75 ++++--
drivers/tee/optee/optee_smc.h | 71 +++++-
drivers/tee/optee/rpc.c | 31 ++-
drivers/tee/optee/rstmem.c | 388 ++++++++++++++++++++++++++++++
drivers/tee/optee/smc_abi.c | 213 ++++++++++++++--
drivers/tee/tee_core.c | 38 ++-
drivers/tee/tee_private.h | 2 +
drivers/tee/tee_rstmem.c | 201 ++++++++++++++++
drivers/tee/tee_shm.c | 2 +
drivers/tee/tee_shm_pool.c | 69 +++++-
include/linux/tee_core.h | 15 ++
include/linux/tee_drv.h | 2 +
include/uapi/linux/tee.h | 44 +++-
20 files changed, 1358 insertions(+), 76 deletions(-)
create mode 100644 drivers/tee/optee/rstmem.c
create mode 100644 drivers/tee/tee_rstmem.c
base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
--
2.43.0
Am 20.12.24 um 15:51 schrieb Danilo Krummrich:
> On Fri, Dec 20, 2024 at 03:11:34PM +0100, Philipp Stanner wrote:
>> On Fri, 2024-12-20 at 14:25 +0100, Christian König wrote:
>>> Am 20.12.24 um 14:18 schrieb Danilo Krummrich:
>>>> On Fri, Dec 20, 2024 at 01:53:34PM +0100, Christian König wrote:
>>>>> Am 20.12.24 um 13:45 schrieb Philipp Stanner:
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 7ce25281c74c..d6f8df39d848 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> + *
>>>>>> + * @sched_job: the job to run
>>>>>> + *
>>>>>> + * Returns: dma_fence the driver must signal once the
>>>>>> hardware has
>>>>>> + * completed the job ("hardware fence").
>>>>>> + *
>>>>>> + * Note that the scheduler expects to 'inherit' its
>>>>>> own reference to
>>>>>> + * this fence from the callback. It does not invoke an
>>>>>> extra
>>>>>> + * dma_fence_get() on it. Consequently, this callback
>>>>>> must return a
>>>>>> + * fence whose refcount is at least 2: One for the
>>>>>> scheduler's
>>>>>> + * reference returned here, another one for the
>>>>>> reference kept by the
>>>>>> + * driver.
>>>>> Well the driver actually doesn't need any extra reference. The
>>>>> scheduler
>>>>> just needs to guarantee that this reference isn't dropped before
>>>>> it is
>>>>> signaled.
>>>> I think he means the reference the driver's fence context has to
>>>> have in order
>>>> to signal that thing eventually.
>>> Yeah, but this is usually a weak reference. IIRC most drivers don't
>>> increment the reference count for the reference they keep to signal a
>>> fence.
>>>
>>> It's expected that the consumers of the dma_fence keep the fence
>>> alive
>>> at least until it is signaled.
>> So are you saying that the driver having an extra reference (without
>> having obtained it with dma_fence_get()) is not an issue because the
>> driver is the one who will signal the fence [and then be done with it]?
> It's never a "real" issue if you have multiple pointers to a reference counted
> object as long as you can ensure that you hold at least one reference for the
> time you have pointers to the object.
Well, I'm not saying that this isn't an issue. I'm just pointing out
that this is the current practice :)
> But, that's bad design. For every pointer to an object a separate reference
> should be taken.
Yeah, completely agree. Weak references are usually a bad idea if you
don't absolutely need them for something.
Regards,
Christian.
Am 20.12.24 um 14:18 schrieb Danilo Krummrich:
> On Fri, Dec 20, 2024 at 01:53:34PM +0100, Christian König wrote:
>> Am 20.12.24 um 13:45 schrieb Philipp Stanner:
>>> From: Philipp Stanner <pstanner(a)redhat.com>
>>>
>>> drm_sched_backend_ops.run_job() returns a dma_fence for the scheduler.
>>> That fence is signalled by the driver once the hardware completed the
>>> associated job. The scheduler does not increment the reference count on
>>> that fence, but implicitly expects to inherit this fence from run_job().
>>>
>>> This is relatively subtle and prone to misunderstandings.
>>>
>>> This implies that, to keep a reference for itself, a driver needs to
>>> call dma_fence_get() in addition to dma_fence_init() in that callback.
>>>
>>> It's further complicated by the fact that the scheduler even decrements
>>> the refcount in drm_sched_run_job_work() since it created a new
>>> reference in drm_sched_fence_scheduled(). It does, however, still use
>>> its pointer to the fence after calling dma_fence_put() - which is safe
>>> because of the aforementioned new reference, but actually still violates
>>> the refcounting rules.
>>>
>>> Improve the explanatory comment for that decrement.
>>>
>>> Move the call to dma_fence_put() to the position behind the last usage
>>> of the fence.
>>>
>>> Document the necessity to increment the reference count in
>>> drm_sched_backend_ops.run_job().
>>>
>>> Cc: Christian König <christian.koenig(a)amd.com>
>>> Cc: Tvrtko Ursulin <tursulin(a)ursulin.net>
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky(a)amd.com>
>>> Signed-off-by: Philipp Stanner <pstanner(a)redhat.com>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++---
>>> include/drm/gpu_scheduler.h | 20 ++++++++++++++++----
>>> 2 files changed, 23 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 7ce25281c74c..d6f8df39d848 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> + *
>>> + * @sched_job: the job to run
>>> + *
>>> + * Returns: dma_fence the driver must signal once the hardware has
>>> + * completed the job ("hardware fence").
>>> + *
>>> + * Note that the scheduler expects to 'inherit' its own reference to
>>> + * this fence from the callback. It does not invoke an extra
>>> + * dma_fence_get() on it. Consequently, this callback must return a
>>> + * fence whose refcount is at least 2: One for the scheduler's
>>> + * reference returned here, another one for the reference kept by the
>>> + * driver.
>> Well the driver actually doesn't need any extra reference. The scheduler
>> just needs to guarantee that this reference isn't dropped before it is
>> signaled.
> I think he means the reference the driver's fence context has to have in order
> to signal that thing eventually.
Yeah, but this is usually a weak reference. IIRC most drivers don't
increment the reference count for the reference they keep to signal a fence.
It's expected that the consumers of the dma_fence keep the fence alive
at least until it is signaled. That's why we have this nice warning in
dma_fence_release().
On the other hand I completely agree it would be more defensive if
drivers increment the reference count for the reference they keep for
signaling.
So if we want to document that the fence reference count should at least
be 2 we somehow need to enforce this with a warning for example.
Regards,
Christian.
>
>> Regards,
>> Christian.
>>
>>> */
>>> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
Am 20.12.24 um 13:45 schrieb Philipp Stanner:
> From: Philipp Stanner <pstanner(a)redhat.com>
>
> drm_sched_backend_ops.run_job() returns a dma_fence for the scheduler.
> That fence is signalled by the driver once the hardware completed the
> associated job. The scheduler does not increment the reference count on
> that fence, but implicitly expects to inherit this fence from run_job().
>
> This is relatively subtle and prone to misunderstandings.
>
> This implies that, to keep a reference for itself, a driver needs to
> call dma_fence_get() in addition to dma_fence_init() in that callback.
>
> It's further complicated by the fact that the scheduler even decrements
> the refcount in drm_sched_run_job_work() since it created a new
> reference in drm_sched_fence_scheduled(). It does, however, still use
> its pointer to the fence after calling dma_fence_put() - which is safe
> because of the aforementioned new reference, but actually still violates
> the refcounting rules.
>
> Improve the explanatory comment for that decrement.
>
> Move the call to dma_fence_put() to the position behind the last usage
> of the fence.
>
> Document the necessity to increment the reference count in
> drm_sched_backend_ops.run_job().
>
> Cc: Christian König <christian.koenig(a)amd.com>
> Cc: Tvrtko Ursulin <tursulin(a)ursulin.net>
> Cc: Andrey Grodzovsky <andrey.grodzovsky(a)amd.com>
> Signed-off-by: Philipp Stanner <pstanner(a)redhat.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++---
> include/drm/gpu_scheduler.h | 20 ++++++++++++++++----
> 2 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7ce25281c74c..d6f8df39d848 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1218,15 +1218,19 @@ static void drm_sched_run_job_work(struct work_struct *w)
> drm_sched_fence_scheduled(s_fence, fence);
>
> if (!IS_ERR_OR_NULL(fence)) {
> - /* Drop for original kref_init of the fence */
> - dma_fence_put(fence);
> -
> r = dma_fence_add_callback(fence, &sched_job->cb,
> drm_sched_job_done_cb);
> if (r == -ENOENT)
> drm_sched_job_done(sched_job, fence->error);
> else if (r)
> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r);
> +
> + /*
> + * s_fence took a new reference to fence in the call to
> + * drm_sched_fence_scheduled() above. The reference passed by
> + * run_job() above is now not needed any longer. Drop it.
> + */
> + dma_fence_put(fence);
> } else {
> drm_sched_job_done(sched_job, IS_ERR(fence) ?
> PTR_ERR(fence) : 0);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 95e17504e46a..a1f5c9a14278 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -420,10 +420,22 @@ struct drm_sched_backend_ops {
> struct drm_sched_entity *s_entity);
>
> /**
> - * @run_job: Called to execute the job once all of the dependencies
> - * have been resolved. This may be called multiple times, if
> - * timedout_job() has happened and drm_sched_job_recovery()
> - * decides to try it again.
> + * @run_job: Called to execute the job once all of the dependencies
> + * have been resolved. This may be called multiple times, if
> + * timedout_job() has happened and drm_sched_job_recovery() decides to
> + * try it again.
Maybe we should improve that here as well while at it.
That drm_sched_job_recovery() can call this multiple times actually goes
against the dma_fence rules since drivers can't easily allocate a new HW
fence.
Something like "The deprecated drm_sched_job_recovery() function might
call this again, but it is strongly advised to not be used because it
violates dma_fence memory allocations rules."
On the other hand can of course be a separate patch.
> + *
> + * @sched_job: the job to run
> + *
> + * Returns: dma_fence the driver must signal once the hardware has
> + * completed the job ("hardware fence").
> + *
> + * Note that the scheduler expects to 'inherit' its own reference to
> + * this fence from the callback. It does not invoke an extra
> + * dma_fence_get() on it. Consequently, this callback must return a
> + * fence whose refcount is at least 2: One for the scheduler's
> + * reference returned here, another one for the reference kept by the
> + * driver.
Well the driver actually doesn't need any extra reference. The scheduler
just needs to guarantee that this reference isn't dropped before it is
signaled.
Regards,
Christian.
> */
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
commit fe52c649438b8489c9456681d93a9b3de3d38263 upstream.
One alternative to the fix Christian proposed in
https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@am…
is to replace the rather complex open coded sorting loops with the kernel
standard sort followed by a context squashing pass.
Proposed advantage of this would be readability but one concern Christian
raised was that there could be many fences, that they are typically mostly
sorted, and so the kernel's heap sort would be much worse by the proposed
algorithm.
I had a look running some games and vkcube to see what are the typical
number of input fences. Tested scenarios:
1) Hogwarts Legacy under Gamescope
450 calls per second to __dma_fence_unwrap_merge.
Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:
N Before After
0 0.91%
1 69.40%
2-3 28.72% 9.4% (90.6% resolved to one fence)
4-5 0.93%
6-9 0.03%
10+
2) Cyberpunk 2077 under Gamescope
1050 calls per second, amounting to 0.01% CPU time according to perf top.
N Before After
0 1.13%
1 52.30%
2-3 40.34% 55.57%
4-5 1.46% 0.50%
6-9 2.44%
10+ 2.34%
3) vkcube under Plasma
90 calls per second.
N Before After
0
1
2-3 100% 0% (Ie. all resolved to a single fence)
4-5
6-9
10+
In the case of vkcube all invocations in the 2-3 bucket were actually
just two input fences.
>From these numbers it looks like the heap sort should not be a
disadvantage, given how the dominant case is <= 2 input fences which heap
sort solves with just one compare and swap. (And for the case of one input
fence we have a fast path in the previous patch.)
A complementary possibility is to implement a different sorting algorithm
under the same API as the kernel's sort() and so keep the simplicity,
potentially moving the new sort under lib/ if it would be found more
widely useful.
v2:
* Hold on to fence references and reduce commentary. (Christian)
* Record and use latest signaled timestamp in the 2nd loop too.
* Consolidate zero or one fences fast paths.
v3:
* Reverse the seqno sort order for a simpler squashing pass. (Christian)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 245a4a7b531c ("dma-buf: generalize dma_fence unwrap & merging v3")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617
Cc: Christian König <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: Gustavo Padovan <gustavo(a)padovan.org>
Cc: Friedrich Vock <friedrich.vock(a)gmx.de>
Cc: linux-media(a)vger.kernel.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: <stable(a)vger.kernel.org> # v6.0+
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115102153.1980-3-tursuli…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma-buf/dma-fence-unwrap.c | 128 ++++++++++++++---------------
1 file changed, 61 insertions(+), 67 deletions(-)
diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index b19d0adf6086..6345062731f1 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -12,6 +12,7 @@
#include <linux/dma-fence-chain.h>
#include <linux/dma-fence-unwrap.h>
#include <linux/slab.h>
+#include <linux/sort.h>
/* Internal helper to start new array iteration, don't use directly */
static struct dma_fence *
@@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct dma_fence_unwrap *cursor)
}
EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
+
+static int fence_cmp(const void *_a, const void *_b)
+{
+ struct dma_fence *a = *(struct dma_fence **)_a;
+ struct dma_fence *b = *(struct dma_fence **)_b;
+
+ if (a->context < b->context)
+ return -1;
+ else if (a->context > b->context)
+ return 1;
+
+ if (dma_fence_is_later(b, a))
+ return 1;
+ else if (dma_fence_is_later(a, b))
+ return -1;
+
+ return 0;
+}
+
/* Implementation for the dma_fence_merge() marco, don't use directly */
struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
struct dma_fence **fences,
@@ -67,8 +87,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
struct dma_fence_array *result;
struct dma_fence *tmp, **array;
ktime_t timestamp;
- unsigned int i;
- size_t count;
+ int i, j, count;
count = 0;
timestamp = ns_to_ktime(0);
@@ -96,80 +115,55 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
if (!array)
return NULL;
- /*
- * This trashes the input fence array and uses it as position for the
- * following merge loop. This works because the dma_fence_merge()
- * wrapper macro is creating this temporary array on the stack together
- * with the iterators.
- */
- for (i = 0; i < num_fences; ++i)
- fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
-
count = 0;
- do {
- unsigned int sel;
-
-restart:
- tmp = NULL;
- for (i = 0; i < num_fences; ++i) {
- struct dma_fence *next;
-
- while (fences[i] && dma_fence_is_signaled(fences[i]))
- fences[i] = dma_fence_unwrap_next(&iter[i]);
-
- next = fences[i];
- if (!next)
- continue;
-
- /*
- * We can't guarantee that inpute fences are ordered by
- * context, but it is still quite likely when this
- * function is used multiple times. So attempt to order
- * the fences by context as we pass over them and merge
- * fences with the same context.
- */
- if (!tmp || tmp->context > next->context) {
- tmp = next;
- sel = i;
-
- } else if (tmp->context < next->context) {
- continue;
-
- } else if (dma_fence_is_later(tmp, next)) {
- fences[i] = dma_fence_unwrap_next(&iter[i]);
- goto restart;
+ for (i = 0; i < num_fences; ++i) {
+ dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
+ if (!dma_fence_is_signaled(tmp)) {
+ array[count++] = dma_fence_get(tmp);
} else {
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- goto restart;
+ ktime_t t = dma_fence_timestamp(tmp);
+
+ if (ktime_after(t, timestamp))
+ timestamp = t;
}
}
-
- if (tmp) {
- array[count++] = dma_fence_get(tmp);
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- }
- } while (tmp);
-
- if (count == 0) {
- tmp = dma_fence_allocate_private_stub(ktime_get());
- goto return_tmp;
}
- if (count == 1) {
- tmp = array[0];
- goto return_tmp;
- }
+ if (count == 0 || count == 1)
+ goto return_fastpath;
- result = dma_fence_array_create(count, array,
- dma_fence_context_alloc(1),
- 1, false);
- if (!result) {
- for (i = 0; i < count; i++)
+ sort(array, count, sizeof(*array), fence_cmp, NULL);
+
+ /*
+ * Only keep the most recent fence for each context.
+ */
+ j = 0;
+ for (i = 1; i < count; i++) {
+ if (array[i]->context == array[j]->context)
dma_fence_put(array[i]);
- tmp = NULL;
- goto return_tmp;
+ else
+ array[++j] = array[i];
}
- return &result->base;
+ count = ++j;
+
+ if (count > 1) {
+ result = dma_fence_array_create(count, array,
+ dma_fence_context_alloc(1),
+ 1, false);
+ if (!result) {
+ for (i = 0; i < count; i++)
+ dma_fence_put(array[i]);
+ tmp = NULL;
+ goto return_tmp;
+ }
+ return &result->base;
+ }
+
+return_fastpath:
+ if (count == 0)
+ tmp = dma_fence_allocate_private_stub(timestamp);
+ else
+ tmp = array[0];
return_tmp:
kfree(array);
--
2.47.1
6.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
commit fe52c649438b8489c9456681d93a9b3de3d38263 upstream.
One alternative to the fix Christian proposed in
https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@am…
is to replace the rather complex open coded sorting loops with the kernel
standard sort followed by a context squashing pass.
Proposed advantage of this would be readability but one concern Christian
raised was that there could be many fences, that they are typically mostly
sorted, and so the kernel's heap sort would be much worse by the proposed
algorithm.
I had a look running some games and vkcube to see what are the typical
number of input fences. Tested scenarios:
1) Hogwarts Legacy under Gamescope
450 calls per second to __dma_fence_unwrap_merge.
Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:
N Before After
0 0.91%
1 69.40%
2-3 28.72% 9.4% (90.6% resolved to one fence)
4-5 0.93%
6-9 0.03%
10+
2) Cyberpunk 2077 under Gamescope
1050 calls per second, amounting to 0.01% CPU time according to perf top.
N Before After
0 1.13%
1 52.30%
2-3 40.34% 55.57%
4-5 1.46% 0.50%
6-9 2.44%
10+ 2.34%
3) vkcube under Plasma
90 calls per second.
N Before After
0
1
2-3 100% 0% (Ie. all resolved to a single fence)
4-5
6-9
10+
In the case of vkcube all invocations in the 2-3 bucket were actually
just two input fences.
>From these numbers it looks like the heap sort should not be a
disadvantage, given how the dominant case is <= 2 input fences which heap
sort solves with just one compare and swap. (And for the case of one input
fence we have a fast path in the previous patch.)
A complementary possibility is to implement a different sorting algorithm
under the same API as the kernel's sort() and so keep the simplicity,
potentially moving the new sort under lib/ if it would be found more
widely useful.
v2:
* Hold on to fence references and reduce commentary. (Christian)
* Record and use latest signaled timestamp in the 2nd loop too.
* Consolidate zero or one fences fast paths.
v3:
* Reverse the seqno sort order for a simpler squashing pass. (Christian)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 245a4a7b531c ("dma-buf: generalize dma_fence unwrap & merging v3")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617
Cc: Christian König <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: Gustavo Padovan <gustavo(a)padovan.org>
Cc: Friedrich Vock <friedrich.vock(a)gmx.de>
Cc: linux-media(a)vger.kernel.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: <stable(a)vger.kernel.org> # v6.0+
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115102153.1980-3-tursuli…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma-buf/dma-fence-unwrap.c | 126 +++++++++++++++++--------------------
1 file changed, 60 insertions(+), 66 deletions(-)
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -12,6 +12,7 @@
#include <linux/dma-fence-chain.h>
#include <linux/dma-fence-unwrap.h>
#include <linux/slab.h>
+#include <linux/sort.h>
/* Internal helper to start new array iteration, don't use directly */
static struct dma_fence *
@@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(
}
EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
+
+static int fence_cmp(const void *_a, const void *_b)
+{
+ struct dma_fence *a = *(struct dma_fence **)_a;
+ struct dma_fence *b = *(struct dma_fence **)_b;
+
+ if (a->context < b->context)
+ return -1;
+ else if (a->context > b->context)
+ return 1;
+
+ if (dma_fence_is_later(b, a))
+ return 1;
+ else if (dma_fence_is_later(a, b))
+ return -1;
+
+ return 0;
+}
+
/* Implementation for the dma_fence_merge() marco, don't use directly */
struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
struct dma_fence **fences,
@@ -67,8 +87,7 @@ struct dma_fence *__dma_fence_unwrap_mer
struct dma_fence_array *result;
struct dma_fence *tmp, **array;
ktime_t timestamp;
- unsigned int i;
- size_t count;
+ int i, j, count;
count = 0;
timestamp = ns_to_ktime(0);
@@ -96,80 +115,55 @@ struct dma_fence *__dma_fence_unwrap_mer
if (!array)
return NULL;
- /*
- * This trashes the input fence array and uses it as position for the
- * following merge loop. This works because the dma_fence_merge()
- * wrapper macro is creating this temporary array on the stack together
- * with the iterators.
- */
- for (i = 0; i < num_fences; ++i)
- fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
-
count = 0;
- do {
- unsigned int sel;
-
-restart:
- tmp = NULL;
- for (i = 0; i < num_fences; ++i) {
- struct dma_fence *next;
-
- while (fences[i] && dma_fence_is_signaled(fences[i]))
- fences[i] = dma_fence_unwrap_next(&iter[i]);
-
- next = fences[i];
- if (!next)
- continue;
-
- /*
- * We can't guarantee that inpute fences are ordered by
- * context, but it is still quite likely when this
- * function is used multiple times. So attempt to order
- * the fences by context as we pass over them and merge
- * fences with the same context.
- */
- if (!tmp || tmp->context > next->context) {
- tmp = next;
- sel = i;
-
- } else if (tmp->context < next->context) {
- continue;
-
- } else if (dma_fence_is_later(tmp, next)) {
- fences[i] = dma_fence_unwrap_next(&iter[i]);
- goto restart;
+ for (i = 0; i < num_fences; ++i) {
+ dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
+ if (!dma_fence_is_signaled(tmp)) {
+ array[count++] = dma_fence_get(tmp);
} else {
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- goto restart;
+ ktime_t t = dma_fence_timestamp(tmp);
+
+ if (ktime_after(t, timestamp))
+ timestamp = t;
}
}
+ }
- if (tmp) {
- array[count++] = dma_fence_get(tmp);
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- }
- } while (tmp);
+ if (count == 0 || count == 1)
+ goto return_fastpath;
- if (count == 0) {
- tmp = dma_fence_allocate_private_stub(ktime_get());
- goto return_tmp;
- }
+ sort(array, count, sizeof(*array), fence_cmp, NULL);
- if (count == 1) {
- tmp = array[0];
- goto return_tmp;
+ /*
+ * Only keep the most recent fence for each context.
+ */
+ j = 0;
+ for (i = 1; i < count; i++) {
+ if (array[i]->context == array[j]->context)
+ dma_fence_put(array[i]);
+ else
+ array[++j] = array[i];
}
+ count = ++j;
- result = dma_fence_array_create(count, array,
- dma_fence_context_alloc(1),
- 1, false);
- if (!result) {
- for (i = 0; i < count; i++)
- dma_fence_put(array[i]);
- tmp = NULL;
- goto return_tmp;
+ if (count > 1) {
+ result = dma_fence_array_create(count, array,
+ dma_fence_context_alloc(1),
+ 1, false);
+ if (!result) {
+ for (i = 0; i < count; i++)
+ dma_fence_put(array[i]);
+ tmp = NULL;
+ goto return_tmp;
+ }
+ return &result->base;
}
- return &result->base;
+
+return_fastpath:
+ if (count == 0)
+ tmp = dma_fence_allocate_private_stub(timestamp);
+ else
+ tmp = array[0];
return_tmp:
kfree(array);
6.12-stable review patch. If anyone has any objections, please let me know.
------------------
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
commit fe52c649438b8489c9456681d93a9b3de3d38263 upstream.
One alternative to the fix Christian proposed in
https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@am…
is to replace the rather complex open coded sorting loops with the kernel
standard sort followed by a context squashing pass.
Proposed advantage of this would be readability but one concern Christian
raised was that there could be many fences, that they are typically mostly
sorted, and so the kernel's heap sort would be much worse by the proposed
algorithm.
I had a look running some games and vkcube to see what are the typical
number of input fences. Tested scenarios:
1) Hogwarts Legacy under Gamescope
450 calls per second to __dma_fence_unwrap_merge.
Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:
N Before After
0 0.91%
1 69.40%
2-3 28.72% 9.4% (90.6% resolved to one fence)
4-5 0.93%
6-9 0.03%
10+
2) Cyberpunk 2077 under Gamescope
1050 calls per second, amounting to 0.01% CPU time according to perf top.
N Before After
0 1.13%
1 52.30%
2-3 40.34% 55.57%
4-5 1.46% 0.50%
6-9 2.44%
10+ 2.34%
3) vkcube under Plasma
90 calls per second.
N Before After
0
1
2-3 100% 0% (Ie. all resolved to a single fence)
4-5
6-9
10+
In the case of vkcube all invocations in the 2-3 bucket were actually
just two input fences.
>From these numbers it looks like the heap sort should not be a
disadvantage, given how the dominant case is <= 2 input fences which heap
sort solves with just one compare and swap. (And for the case of one input
fence we have a fast path in the previous patch.)
A complementary possibility is to implement a different sorting algorithm
under the same API as the kernel's sort() and so keep the simplicity,
potentially moving the new sort under lib/ if it would be found more
widely useful.
v2:
* Hold on to fence references and reduce commentary. (Christian)
* Record and use latest signaled timestamp in the 2nd loop too.
* Consolidate zero or one fences fast paths.
v3:
* Reverse the seqno sort order for a simpler squashing pass. (Christian)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 245a4a7b531c ("dma-buf: generalize dma_fence unwrap & merging v3")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617
Cc: Christian König <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: Gustavo Padovan <gustavo(a)padovan.org>
Cc: Friedrich Vock <friedrich.vock(a)gmx.de>
Cc: linux-media(a)vger.kernel.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: <stable(a)vger.kernel.org> # v6.0+
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115102153.1980-3-tursuli…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma-buf/dma-fence-unwrap.c | 126 +++++++++++++++++--------------------
1 file changed, 60 insertions(+), 66 deletions(-)
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -12,6 +12,7 @@
#include <linux/dma-fence-chain.h>
#include <linux/dma-fence-unwrap.h>
#include <linux/slab.h>
+#include <linux/sort.h>
/* Internal helper to start new array iteration, don't use directly */
static struct dma_fence *
@@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(
}
EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
+
+static int fence_cmp(const void *_a, const void *_b)
+{
+ struct dma_fence *a = *(struct dma_fence **)_a;
+ struct dma_fence *b = *(struct dma_fence **)_b;
+
+ if (a->context < b->context)
+ return -1;
+ else if (a->context > b->context)
+ return 1;
+
+ if (dma_fence_is_later(b, a))
+ return 1;
+ else if (dma_fence_is_later(a, b))
+ return -1;
+
+ return 0;
+}
+
/* Implementation for the dma_fence_merge() marco, don't use directly */
struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
struct dma_fence **fences,
@@ -67,8 +87,7 @@ struct dma_fence *__dma_fence_unwrap_mer
struct dma_fence_array *result;
struct dma_fence *tmp, **array;
ktime_t timestamp;
- unsigned int i;
- size_t count;
+ int i, j, count;
count = 0;
timestamp = ns_to_ktime(0);
@@ -96,80 +115,55 @@ struct dma_fence *__dma_fence_unwrap_mer
if (!array)
return NULL;
- /*
- * This trashes the input fence array and uses it as position for the
- * following merge loop. This works because the dma_fence_merge()
- * wrapper macro is creating this temporary array on the stack together
- * with the iterators.
- */
- for (i = 0; i < num_fences; ++i)
- fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
-
count = 0;
- do {
- unsigned int sel;
-
-restart:
- tmp = NULL;
- for (i = 0; i < num_fences; ++i) {
- struct dma_fence *next;
-
- while (fences[i] && dma_fence_is_signaled(fences[i]))
- fences[i] = dma_fence_unwrap_next(&iter[i]);
-
- next = fences[i];
- if (!next)
- continue;
-
- /*
- * We can't guarantee that inpute fences are ordered by
- * context, but it is still quite likely when this
- * function is used multiple times. So attempt to order
- * the fences by context as we pass over them and merge
- * fences with the same context.
- */
- if (!tmp || tmp->context > next->context) {
- tmp = next;
- sel = i;
-
- } else if (tmp->context < next->context) {
- continue;
-
- } else if (dma_fence_is_later(tmp, next)) {
- fences[i] = dma_fence_unwrap_next(&iter[i]);
- goto restart;
+ for (i = 0; i < num_fences; ++i) {
+ dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
+ if (!dma_fence_is_signaled(tmp)) {
+ array[count++] = dma_fence_get(tmp);
} else {
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- goto restart;
+ ktime_t t = dma_fence_timestamp(tmp);
+
+ if (ktime_after(t, timestamp))
+ timestamp = t;
}
}
+ }
- if (tmp) {
- array[count++] = dma_fence_get(tmp);
- fences[sel] = dma_fence_unwrap_next(&iter[sel]);
- }
- } while (tmp);
+ if (count == 0 || count == 1)
+ goto return_fastpath;
- if (count == 0) {
- tmp = dma_fence_allocate_private_stub(ktime_get());
- goto return_tmp;
- }
+ sort(array, count, sizeof(*array), fence_cmp, NULL);
- if (count == 1) {
- tmp = array[0];
- goto return_tmp;
+ /*
+ * Only keep the most recent fence for each context.
+ */
+ j = 0;
+ for (i = 1; i < count; i++) {
+ if (array[i]->context == array[j]->context)
+ dma_fence_put(array[i]);
+ else
+ array[++j] = array[i];
}
+ count = ++j;
- result = dma_fence_array_create(count, array,
- dma_fence_context_alloc(1),
- 1, false);
- if (!result) {
- for (i = 0; i < count; i++)
- dma_fence_put(array[i]);
- tmp = NULL;
- goto return_tmp;
+ if (count > 1) {
+ result = dma_fence_array_create(count, array,
+ dma_fence_context_alloc(1),
+ 1, false);
+ if (!result) {
+ for (i = 0; i < count; i++)
+ dma_fence_put(array[i]);
+ tmp = NULL;
+ goto return_tmp;
+ }
+ return &result->base;
}
- return &result->base;
+
+return_fastpath:
+ if (count == 0)
+ tmp = dma_fence_allocate_private_stub(timestamp);
+ else
+ tmp = array[0];
return_tmp:
kfree(array);