On Fri, May 16, 2025 at 02:19:45PM +0800, Xu Yilun wrote:
> > I don't know why you'd disable a viommu while the VM is running,
> > doesn't make sense.
>
> Here it means remove the CC setup for viommu, shared setup is still
> kept.
That might makes sense for the vPCI function, but not the vIOMMU. A
secure VIOMMU needs to be running at all times while the guest is
running. Perhaps it has no devices it can be used with, but it's
functionality has to be there because a driver in the VM will be
connected to it.
At most "bind" should only tell the already existing secure vIOMMU
that it is allowed to translate for a specific vPCI function.
Jason
On 5/16/25 11:49, wangtao wrote:
>>>> Please try using udmabuf with sendfile() as confirmed to be working by
>> T.J.
>>> [wangtao] Using buffer IO with dmabuf file read/write requires one
>> memory copy.
>>> Direct IO removes this copy to enable zero-copy. The sendfile system
>>> call reduces memory copies from two (read/write) to one. However, with
>>> udmabuf, sendfile still keeps at least one copy, failing zero-copy.
>>
>>
>> Then please work on fixing this.
> [wangtao] What needs fixing? Does sendfile achieve zero-copy?
> sendfile reduces memory copies (from 2 to 1) for network sockets,
> but still requires one copy and cannot achieve zero copies.
Well why not? See sendfile() is the designated Linux uAPI for moving data between two files, maybe splice() is also appropriate.
The memory file descriptor and your destination file are both a files. So those uAPIs apply.
Now what you suggest is to add a new IOCTL to do this in a very specific manner just for the system DMA-buf heap. And as far as I can see that is in general a complete no-go.
I mean I understand why you do this. Instead of improving the existing functionality you're just hacking something together because it is simple for you.
It might be possible to implement that generic for DMA-buf heaps if udmabuf allocation overhead can't be reduced, but that is then just the second step.
Regards,
Christian.
Hi!
I previously discussed this with Simona on IRC but would like to get
some feedback also from a wider audience:
We're planning to share dma-bufs using a fast interconnect in a way
similar to pcie-p2p:
The rough plan is to identify dma-bufs capable of sharing this way by
looking at the address of either the dma-buf ops and / or the
importer_ops to conclude it's a device using the same driver (or
possibly child driver) and then take special action when the dma-
addresses are obtained. Nothing visible outside of the xe driver or its
child driver.
Are there any absolute "DON'T"s or recommendations to keep in mind WRT
to this approach?
Thanks,
Thomas
On Fri, May 16, 2025 at 02:02:29AM +0800, Xu Yilun wrote:
> > IMHO, I think it might be helpful that you can picture out what are the
> > minimum requirements (function/life cycle) to the current IOMMUFD TSM
> > bind architecture:
> >
> > 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> > the TVM-HOST call.
> > 2. TDI acceptance is handled in guest_request() to accept the TDI after
> > the validation in the TVM)
>
> I'll try my best to brainstorm and make a flow in ASCII.
>
> (*) means new feature
>
>
> Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> ----- --------- ---- ---- ------- -------- ---
> 1. *Connect(IDE)
> 2. Init vdev
open /dev/vfio/XX as a VFIO action
Then VFIO attaches to IOMMUFD as an iommufd action creating the idev
> 3. *create dmabuf
> 4. *export dmabuf
> 5. create memslot
> 6. *import dmabuf
> 7. setup shared DMA
> 8. create hwpt
> 9. attach hwpt
> 10. kvm run
> 11.enum shared dev
> 12.*Connect(Bind)
> 13. *GHCI Bind
> 14. *Bind
> 15 CC viommu alloc
> 16. vdevice allloc
viommu and vdevice creation happen before KVM run. The vPCI function
is visible to the guest from the very start, even though it is in T=0
mode. If a platform does not require any special CC steps prior to KVM
run then it just has a NOP for these functions.
What you have here is some new BIND operation against the already
existing vdevice as we discussed earlier.
> 16. *attach vdev
> 17. *setup CC viommu
> 18 *tsm_bind
> 19. *bind
> 20.*Attest
> 21. *GHCI get CC info
> 22. *get CC info
> 23. *vdev guest req
> 24. *guest req
> 25.*Accept
> 26. *GHCI accept MMIO/DMA
> 27. *accept MMIO/DMA
> 28. *vdev guest req
> 29. *guest req
> 30. *map private MMIO
> 31. *GHCI start tdi
> 32. *start tdi
> 33. *vdev guest req
> 34. *guest req
This seems reasonable you want to have some generic RPC scheme to
carry messages fro mthe VM to the TSM tunneled through the iommufd
vdevice (because the vdevice has the vPCI ID, the KVM ID, the VIOMMU
id and so on)
> 35.Workload...
> 36.*disconnect(Unbind)
> 37. *GHCI unbind
> 38. *Unbind
> 39. *detach vdev
unbind vdev. vdev remains until kvm is stopped.
> 40. *tsm_unbind
> 41. *TDX stop tdi
> 42. *TDX disable mmio cb
> 43. *cb dmabuf revoke
> 44. *unmap private MMIO
> 45. *TDX disable dma cb
> 46. *cb disable CC viommu
I don't know why you'd disable a viommu while the VM is running,
doesn't make sense.
> 47. *TDX tdi free
> 48. *enable mmio
> 49. *cb dmabuf recover
> 50.workable shared dev
This is a nice chart, it would be good to see a comparable chart for
AMD and ARM
Jason
On Fri, May 16, 2025 at 12:04:04AM +0800, Xu Yilun wrote:
> > arches this was mostly invisible to the hypervisor?
>
> Attest & Accept can be invisible to hypervisor, or host just help pass
> data blobs between guest, firmware & device.
>
> Bind cannot be host agnostic, host should be aware not to touch device
> after Bind.
I'm not sure this is fully true, this could be a Intel thing. When the
vPCI is created the host can already know it shouldn't touch the PCI
device anymore and the secure world would enforce that when it gets a
bind command.
The fact it hasn't been locked out immediately at vPCI creation time
is sort of a detail that doesn't matter, IMHO.
> > It might be reasonable to have VFIO reach into iommufd to do that on
> > an already existing iommufd VDEVICE object. A little weird, but we
> > could probably make that work.
>
> Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
>
> ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> -> iommufd_device_attach_vdev()
> -> tsm_tdi_bind()
Not ATTACH, you wanted BIND. You could have a VFIO_DEVICE_BIND(iommufd
vdevice id)
> > sees VFIO remove the S-EPT and release the KVM, then have iommufd
> > destroy the VDEVICE object.
>
> Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> destroyed after all assigned CC vPCI devices are destroyed.
And KVM destroys the VM?
> Nowadays, VFIO already holds KVM reference, so we need
>
> close(vfio_fd)
> -> iommufd_device_detach_vdev()
This doesn't happen though, it destroys the normal device (idev) which
the vdevice is stacked on top of. You'd have to make normal device
destruction trigger vdevice destruction
> -> tsm_tdi_unbind()
> -> tdi stop
> -> callback to VFIO, dmabuf_move_notify(revoke)
> -> KVM unmap MMIO
> -> tdi metadata remove
This omits the viommu. It won't get destroyed until the iommufd
closes, so iommufd will be holding the kvm and it will do the final
put.
Jason
On 5/15/25 16:03, wangtao wrote:
> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> Allocation: 32x32MB buffer creation
> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> - Note: shmem shows excessive allocation time
Yeah, that is something already noted by others as well. But that is orthogonal.
>
> Read 1024MB File:
> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> - Note: pin_user_pages_fast consumes majority CPU cycles
>
> Key function call timing: See details below.
Those aren't valid, you are comparing different functionalities here.
Please try using udmabuf with sendfile() as confirmed to be working by T.J.
Regards,
Christian.
On Wed, May 14, 2025 at 2:00 PM Song Liu <song(a)kernel.org> wrote:
>
> On Tue, May 13, 2025 at 9:36 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > Use the same test buffers as the traditional iterator and a new BPF map
> > to verify the test buffers can be found with the open coded dmabuf
> > iterator.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > Acked-by: Christian König <christian.koenig(a)amd.com>
> > Acked-by: Song Liu <song(a)kernel.org>
> > ---
> > .../testing/selftests/bpf/bpf_experimental.h | 5 +++
> > .../selftests/bpf/prog_tests/dmabuf_iter.c | 41 +++++++++++++++++++
> > .../testing/selftests/bpf/progs/dmabuf_iter.c | 38 +++++++++++++++++
> > 3 files changed, 84 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> > index 6535c8ae3c46..5e512a1d09d1 100644
> > --- a/tools/testing/selftests/bpf/bpf_experimental.h
> > +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> > @@ -591,4 +591,9 @@ extern int bpf_iter_kmem_cache_new(struct bpf_iter_kmem_cache *it) __weak __ksym
> > extern struct kmem_cache *bpf_iter_kmem_cache_next(struct bpf_iter_kmem_cache *it) __weak __ksym;
> > extern void bpf_iter_kmem_cache_destroy(struct bpf_iter_kmem_cache *it) __weak __ksym;
> >
> > +struct bpf_iter_dmabuf;
> > +extern int bpf_iter_dmabuf_new(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +extern struct dma_buf *bpf_iter_dmabuf_next(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +extern void bpf_iter_dmabuf_destroy(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +
> > #endif
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > index dc740bd0e2bd..6c2b0c3dbcd8 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > @@ -219,14 +219,52 @@ static void subtest_dmabuf_iter_check_default_iter(struct dmabuf_iter *skel)
> > close(iter_fd);
> > }
> >
> > +static void subtest_dmabuf_iter_check_open_coded(struct dmabuf_iter *skel, int map_fd)
> > +{
> > + LIBBPF_OPTS(bpf_test_run_opts, topts);
> > + char key[DMA_BUF_NAME_LEN];
> > + int err, fd;
> > + bool found;
> > +
> > + /* No need to attach it, just run it directly */
> > + fd = bpf_program__fd(skel->progs.iter_dmabuf_for_each);
> > +
> > + err = bpf_prog_test_run_opts(fd, &topts);
> > + if (!ASSERT_OK(err, "test_run_opts err"))
> > + return;
> > + if (!ASSERT_OK(topts.retval, "test_run_opts retval"))
> > + return;
> > +
> > + if (!ASSERT_OK(bpf_map_get_next_key(map_fd, NULL, key), "get next key"))
> > + return;
> > +
> > + do {
> > + ASSERT_OK(bpf_map_lookup_elem(map_fd, key, &found), "lookup");
> > + ASSERT_TRUE(found, "found test buffer");
>
> This check failed once in the CI, on s390:
>
> Error: #89/3 dmabuf_iter/open_coded
> 9309 subtest_dmabuf_iter_check_open_coded:PASS:test_run_opts err 0 nsec
> 9310 subtest_dmabuf_iter_check_open_coded:PASS:test_run_opts retval 0 nsec
> 9311 subtest_dmabuf_iter_check_open_coded:PASS:get next key 0 nsec
> 9312 subtest_dmabuf_iter_check_open_coded:PASS:lookup 0 nsec
> 9313 subtest_dmabuf_iter_check_open_coded:FAIL:found test buffer
> unexpected found test buffer: got FALSE
>
> But it passed in the rerun. It is probably a bit flakey. Maybe we need some
> barrier somewhere.
>
> Here is the failure:
>
> https://github.com/kernel-patches/bpf/actions/runs/15002058808/job/42234864…
>
> To see the log, you need to log in GitHub.
>
> Thanks,
> Song
Thanks, yeah I have been trying to run this locally today but still
working on setting up an environment for it. Daniel Xu thoughtfully
suggested I use a github PR to trigger CI, but I tried that last week
without success: https://github.com/kernel-patches/bpf/pull/8910
I'm not sure if this is the cause (doesn't show up on the runs that
pass) but I have no idea why that would be intermittently failing:
libbpf: Error in bpf_create_map_xattr(testbuf_hash): -EINVAL. Retrying
without BTF.
> > + } while (bpf_map_get_next_key(map_fd, key, key));
> > +}
>
> [...]
On Wed, May 14, 2025 at 1:53 PM Song Liu <song(a)kernel.org> wrote:
>
> On Tue, May 13, 2025 at 9:36 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > This test creates a udmabuf, and a dmabuf from the system dmabuf heap,
> > and uses a BPF program that prints dmabuf metadata with the new
> > dmabuf_iter to verify they can be found.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > Acked-by: Christian König <christian.koenig(a)amd.com>
>
> Acked-by: Song Liu <song(a)kernel.org>
Thanks.
>
> With one more comment below.
>
> [...]
>
> > diff --git a/tools/testing/selftests/bpf/progs/dmabuf_iter.c b/tools/testing/selftests/bpf/progs/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..2a1b5397196d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/dmabuf_iter.c
> > @@ -0,0 +1,53 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Google LLC */
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_core_read.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +/* From uapi/linux/dma-buf.h */
> > +#define DMA_BUF_NAME_LEN 32
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/*
> > + * Fields output by this iterator are delimited by newlines. Convert any
> > + * newlines in user-provided printed strings to spaces.
> > + */
> > +static void sanitize_string(char *src, size_t size)
> > +{
> > + for (char *c = src; *c && (size_t)(c - src) < size; ++c)
>
> We should do the size check first, right? IOW:
>
> for (char *c = src; (size_t)(c - src) < size && *c; ++c)
Yeah if you call the function with size = 0, which is kinda
questionable and not possible with the non-zero array size that is
tied to immutable UAPI. Let's change it like you suggest.
>
> > + if (*c == '\n')
> > + *c = ' ';
> > +}
> > +
> [...]
From: Rob Clark <robdclark(a)chromium.org>
Conversion to DRM GPU VA Manager[1], and adding support for Vulkan Sparse
Memory[2] in the form of:
1. A new VM_BIND submitqueue type for executing VM MSM_SUBMIT_BO_OP_MAP/
MAP_NULL/UNMAP commands
2. A new VM_BIND ioctl to allow submitting batches of one or more
MAP/MAP_NULL/UNMAP commands to a VM_BIND submitqueue
I did not implement support for synchronous VM_BIND commands. Since
userspace could just immediately wait for the `SUBMIT` to complete, I don't
think we need this extra complexity in the kernel. Synchronous/immediate
VM_BIND operations could be implemented with a 2nd VM_BIND submitqueue.
The corresponding mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32533
Changes in v4:
- Various locking/etc fixes
- Optimize the pgtable preallocation. If userspace sorts the VM_BIND ops
then the kernel detects ops that fall into the same 2MB last level PTD
to avoid duplicate page preallocation.
- Add way to throttle pushing jobs to the scheduler, to cap the amount of
potentially temporary prealloc'd pgtable pages.
- Add vm_log to devcoredump for debugging. If the vm_log_shift module
param is set, keep a log of the last 1<<vm_log_shift VM updates for
easier debugging of faults/crashes.
- Link to v3: https://lore.kernel.org/all/20250428205619.227835-1-robdclark@gmail.com/
Changes in v3:
- Switched to seperate VM_BIND ioctl. This makes the UABI a bit
cleaner, but OTOH the userspace code was cleaner when the end result
of either type of VkQueue lead to the same ioctl. So I'm a bit on
the fence.
- Switched to doing the gpuvm bookkeeping synchronously, and only
deferring the pgtable updates. This avoids needing to hold any resv
locks in the fence signaling path, resolving the last shrinker related
lockdep complaints. OTOH it means userspace can trigger invalid
pgtable updates with multiple VM_BIND queues. In this case, we ensure
that unmaps happen completely (to prevent userspace from using this to
access free'd pages), mark the context as unusable, and move on with
life.
- Link to v2: https://lore.kernel.org/all/20250319145425.51935-1-robdclark@gmail.com/
Changes in v2:
- Dropped Bibek Kumar Patro's arm-smmu patches[3], which have since been
merged.
- Pre-allocate all the things, and drop HACK patch which disabled shrinker.
This includes ensuring that vm_bo objects are allocated up front, pre-
allocating VMA objects, and pre-allocating pages used for pgtable updates.
The latter utilizes io_pgtable_cfg callbacks for pgtable alloc/free, that
were initially added for panthor.
- Add back support for BO dumping for devcoredump.
- Link to v1 (RFC): https://lore.kernel.org/dri-devel/20241207161651.410556-1-robdclark@gmail.c…
[1] https://www.kernel.org/doc/html/next/gpu/drm-mm.html#drm-gpuvm
[2] https://docs.vulkan.org/spec/latest/chapters/sparsemem.html
[3] https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=909700
Rob Clark (40):
drm/gpuvm: Don't require obj lock in destructor path
drm/gpuvm: Allow VAs to hold soft reference to BOs
drm/gem: Add ww_acquire_ctx support to drm_gem_lru_scan()
drm/sched: Add enqueue credit limit
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
drm/msm: Rename msm_file_private -> msm_context
drm/msm: Improve msm_context comments
drm/msm: Rename msm_gem_address_space -> msm_gem_vm
drm/msm: Remove vram carveout support
drm/msm: Collapse vma allocation and initialization
drm/msm: Collapse vma close and delete
drm/msm: Don't close VMAs on purge
drm/msm: drm_gpuvm conversion
drm/msm: Convert vm locking
drm/msm: Use drm_gpuvm types more
drm/msm: Split out helper to get iommu prot flags
drm/msm: Add mmu support for non-zero offset
drm/msm: Add PRR support
drm/msm: Rename msm_gem_vma_purge() -> _unmap()
drm/msm: Drop queued submits on lastclose()
drm/msm: Lazily create context VM
drm/msm: Add opt-in for VM_BIND
drm/msm: Mark VM as unusable on GPU hangs
drm/msm: Add _NO_SHARE flag
drm/msm: Crashdump prep for sparse mappings
drm/msm: rd dumping prep for sparse mappings
drm/msm: Crashdec support for sparse
drm/msm: rd dumping support for sparse
drm/msm: Extract out syncobj helpers
drm/msm: Use DMA_RESV_USAGE_BOOKKEEP/KERNEL
drm/msm: Add VM_BIND submitqueue
drm/msm: Support IO_PGTABLE_QUIRK_NO_WARN_ON
drm/msm: Support pgtable preallocation
drm/msm: Split out map/unmap ops
drm/msm: Add VM_BIND ioctl
drm/msm: Add VM logging for VM_BIND updates
drm/msm: Add VMA unmap reason
drm/msm: Add mmu prealloc tracepoint
drm/msm: use trylock for debugfs
drm/msm: Bump UAPI version
drivers/gpu/drm/drm_gem.c | 14 +-
drivers/gpu/drm/drm_gpuvm.c | 15 +-
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/Makefile | 1 +
drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 25 +-
drivers/gpu/drm/msm/adreno/a2xx_gpummu.c | 5 +-
drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 32 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.h | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 49 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 6 +-
drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/adreno_device.c | 4 -
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 99 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 23 +-
.../drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.h | 2 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_kms.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h | 4 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 6 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_kms.c | 28 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_plane.c | 12 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 4 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_kms.c | 19 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_plane.c | 12 +-
drivers/gpu/drm/msm/dsi/dsi_host.c | 14 +-
drivers/gpu/drm/msm/msm_drv.c | 184 +--
drivers/gpu/drm/msm/msm_drv.h | 35 +-
drivers/gpu/drm/msm/msm_fb.c | 18 +-
drivers/gpu/drm/msm/msm_fbdev.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 494 +++---
drivers/gpu/drm/msm/msm_gem.h | 247 ++-
drivers/gpu/drm/msm/msm_gem_prime.c | 15 +
drivers/gpu/drm/msm/msm_gem_shrinker.c | 104 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 295 ++--
drivers/gpu/drm/msm/msm_gem_vma.c | 1471 ++++++++++++++++-
drivers/gpu/drm/msm/msm_gpu.c | 214 ++-
drivers/gpu/drm/msm/msm_gpu.h | 144 +-
drivers/gpu/drm/msm/msm_gpu_trace.h | 14 +
drivers/gpu/drm/msm/msm_iommu.c | 302 +++-
drivers/gpu/drm/msm/msm_kms.c | 18 +-
drivers/gpu/drm/msm/msm_kms.h | 2 +-
drivers/gpu/drm/msm/msm_mmu.h | 38 +-
drivers/gpu/drm/msm/msm_rd.c | 62 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 10 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 96 +-
drivers/gpu/drm/msm/msm_syncobj.c | 172 ++
drivers/gpu/drm/msm/msm_syncobj.h | 37 +
drivers/gpu/drm/scheduler/sched_entity.c | 16 +-
drivers/gpu/drm/scheduler/sched_main.c | 3 +
drivers/iommu/io-pgtable-arm.c | 27 +-
include/drm/drm_gem.h | 10 +-
include/drm/drm_gpuvm.h | 12 +-
include/drm/gpu_scheduler.h | 13 +-
include/linux/io-pgtable.h | 8 +
include/uapi/drm/msm_drm.h | 149 +-
63 files changed, 3484 insertions(+), 1251 deletions(-)
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.c
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.h
--
2.49.0
From: Rob Clark <robdclark(a)chromium.org>
Conversion to DRM GPU VA Manager[1], and adding support for Vulkan Sparse
Memory[2] in the form of:
1. A new VM_BIND submitqueue type for executing VM MSM_SUBMIT_BO_OP_MAP/
MAP_NULL/UNMAP commands
2. A new VM_BIND ioctl to allow submitting batches of one or more
MAP/MAP_NULL/UNMAP commands to a VM_BIND submitqueue
I did not implement support for synchronous VM_BIND commands. Since
userspace could just immediately wait for the `SUBMIT` to complete, I don't
think we need this extra complexity in the kernel. Synchronous/immediate
VM_BIND operations could be implemented with a 2nd VM_BIND submitqueue.
The corresponding mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32533
Changes in v4:
- Various locking/etc fixes
- Optimize the pgtable preallocation. If userspace sorts the VM_BIND ops
then the kernel detects ops that fall into the same 2MB last level PTD
to avoid duplicate page preallocation.
- Add way to throttle pushing jobs to the scheduler, to cap the amount of
potentially temporary prealloc'd pgtable pages.
- Add vm_log to devcoredump for debugging. If the vm_log_shift module
param is set, keep a log of the last 1<<vm_log_shift VM updates for
easier debugging of faults/crashes.
- Link to v3: https://lore.kernel.org/all/20250428205619.227835-1-robdclark@gmail.com/
Changes in v3:
- Switched to seperate VM_BIND ioctl. This makes the UABI a bit
cleaner, but OTOH the userspace code was cleaner when the end result
of either type of VkQueue lead to the same ioctl. So I'm a bit on
the fence.
- Switched to doing the gpuvm bookkeeping synchronously, and only
deferring the pgtable updates. This avoids needing to hold any resv
locks in the fence signaling path, resolving the last shrinker related
lockdep complaints. OTOH it means userspace can trigger invalid
pgtable updates with multiple VM_BIND queues. In this case, we ensure
that unmaps happen completely (to prevent userspace from using this to
access free'd pages), mark the context as unusable, and move on with
life.
- Link to v2: https://lore.kernel.org/all/20250319145425.51935-1-robdclark@gmail.com/
Changes in v2:
- Dropped Bibek Kumar Patro's arm-smmu patches[3], which have since been
merged.
- Pre-allocate all the things, and drop HACK patch which disabled shrinker.
This includes ensuring that vm_bo objects are allocated up front, pre-
allocating VMA objects, and pre-allocating pages used for pgtable updates.
The latter utilizes io_pgtable_cfg callbacks for pgtable alloc/free, that
were initially added for panthor.
- Add back support for BO dumping for devcoredump.
- Link to v1 (RFC): https://lore.kernel.org/dri-devel/20241207161651.410556-1-robdclark@gmail.c…
[1] https://www.kernel.org/doc/html/next/gpu/drm-mm.html#drm-gpuvm
[2] https://docs.vulkan.org/spec/latest/chapters/sparsemem.html
[3] https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=909700
Rob Clark (40):
drm/gpuvm: Don't require obj lock in destructor path
drm/gpuvm: Allow VAs to hold soft reference to BOs
drm/gem: Add ww_acquire_ctx support to drm_gem_lru_scan()
drm/sched: Add enqueue credit limit
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
drm/msm: Rename msm_file_private -> msm_context
drm/msm: Improve msm_context comments
drm/msm: Rename msm_gem_address_space -> msm_gem_vm
drm/msm: Remove vram carveout support
drm/msm: Collapse vma allocation and initialization
drm/msm: Collapse vma close and delete
drm/msm: Don't close VMAs on purge
drm/msm: drm_gpuvm conversion
drm/msm: Convert vm locking
drm/msm: Use drm_gpuvm types more
drm/msm: Split out helper to get iommu prot flags
drm/msm: Add mmu support for non-zero offset
drm/msm: Add PRR support
drm/msm: Rename msm_gem_vma_purge() -> _unmap()
drm/msm: Drop queued submits on lastclose()
drm/msm: Lazily create context VM
drm/msm: Add opt-in for VM_BIND
drm/msm: Mark VM as unusable on GPU hangs
drm/msm: Add _NO_SHARE flag
drm/msm: Crashdump prep for sparse mappings
drm/msm: rd dumping prep for sparse mappings
drm/msm: Crashdec support for sparse
drm/msm: rd dumping support for sparse
drm/msm: Extract out syncobj helpers
drm/msm: Use DMA_RESV_USAGE_BOOKKEEP/KERNEL
drm/msm: Add VM_BIND submitqueue
drm/msm: Support IO_PGTABLE_QUIRK_NO_WARN_ON
drm/msm: Support pgtable preallocation
drm/msm: Split out map/unmap ops
drm/msm: Add VM_BIND ioctl
drm/msm: Add VM logging for VM_BIND updates
drm/msm: Add VMA unmap reason
drm/msm: Add mmu prealloc tracepoint
drm/msm: use trylock for debugfs
drm/msm: Bump UAPI version
drivers/gpu/drm/drm_gem.c | 14 +-
drivers/gpu/drm/drm_gpuvm.c | 15 +-
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/Makefile | 1 +
drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 25 +-
drivers/gpu/drm/msm/adreno/a2xx_gpummu.c | 5 +-
drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 32 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.h | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 49 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 6 +-
drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/adreno_device.c | 4 -
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 99 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 23 +-
.../drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.h | 2 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_kms.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h | 4 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 6 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_kms.c | 28 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_plane.c | 12 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 4 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_kms.c | 19 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_plane.c | 12 +-
drivers/gpu/drm/msm/dsi/dsi_host.c | 14 +-
drivers/gpu/drm/msm/msm_drv.c | 184 +--
drivers/gpu/drm/msm/msm_drv.h | 35 +-
drivers/gpu/drm/msm/msm_fb.c | 18 +-
drivers/gpu/drm/msm/msm_fbdev.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 494 +++---
drivers/gpu/drm/msm/msm_gem.h | 247 ++-
drivers/gpu/drm/msm/msm_gem_prime.c | 15 +
drivers/gpu/drm/msm/msm_gem_shrinker.c | 104 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 295 ++--
drivers/gpu/drm/msm/msm_gem_vma.c | 1471 ++++++++++++++++-
drivers/gpu/drm/msm/msm_gpu.c | 214 ++-
drivers/gpu/drm/msm/msm_gpu.h | 144 +-
drivers/gpu/drm/msm/msm_gpu_trace.h | 14 +
drivers/gpu/drm/msm/msm_iommu.c | 302 +++-
drivers/gpu/drm/msm/msm_kms.c | 18 +-
drivers/gpu/drm/msm/msm_kms.h | 2 +-
drivers/gpu/drm/msm/msm_mmu.h | 38 +-
drivers/gpu/drm/msm/msm_rd.c | 62 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 10 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 96 +-
drivers/gpu/drm/msm/msm_syncobj.c | 172 ++
drivers/gpu/drm/msm/msm_syncobj.h | 37 +
drivers/gpu/drm/scheduler/sched_entity.c | 16 +-
drivers/gpu/drm/scheduler/sched_main.c | 3 +
drivers/iommu/io-pgtable-arm.c | 27 +-
include/drm/drm_gem.h | 10 +-
include/drm/drm_gpuvm.h | 12 +-
include/drm/gpu_scheduler.h | 13 +-
include/linux/io-pgtable.h | 8 +
include/uapi/drm/msm_drm.h | 149 +-
63 files changed, 3484 insertions(+), 1251 deletions(-)
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.c
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.h
--
2.49.0
On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
> > We have an awkward fit for what CCA people are doing to the various
> > Linux APIs. Looking somewhat maximally across all the arches a "bind"
> > for a CC vPCI device creation operation does:
> >
> > - Setup the CPU page tables for the VM to have access to the MMIO
>
> This is guest side thing, is it? Anything host need to opt-in?
CPU hypervisor page tables.
> > - Revoke hypervisor access to the MMIO
>
> VFIO could choose never to mmap MMIO, so in this case nothing to do?
Yes, if you do it that way.
> > - Setup the vIOMMU to understand the vPCI device
> > - Take over control of some of the IOVA translation, at least for T=1,
> > and route to the the vIOMMU
> > - Register the vPCI with any attestation functions the VM might use
> > - Do some DOE stuff to manage/validate TDSIP/etc
>
> Intel TDX Connect has a extra requirement for "unbind":
>
> - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
> CONFIG_UNLOCK
Maybe you could express this as the S-EPT always has the MMIO mapped
into it as long as the vPCI function is installed to the VM? Is KVM
responsible for the S-EPT?
> Another thing is, seems your term "bind" includes all steps for
> shared -> private conversion.
Well, I was talking about vPCI creation. I understand that during the
vPCI lifecycle the VM will do "bind" "unbind" which are more or less
switching the device into a T=1 mode. Though I understood on some
arches this was mostly invisible to the hypervisor?
> But in my mind, "bind" only includes
> putting device in TDISP LOCK state & corresponding host setups required
> by firmware. I.e "bind" means host lockes down the CC setup, waiting for
> guest attestation.
So we will need to have some other API for this that modifies the vPCI
object.
It might be reasonable to have VFIO reach into iommufd to do that on
an already existing iommufd VDEVICE object. A little weird, but we
could probably make that work.
But you have some weird ordering issues here if the S-EPT has to have
the VFIO MMIO then you have to have a close() destruction order that
sees VFIO remove the S-EPT and release the KVM, then have iommufd
destroy the VDEVICE object.
> > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > stays in VFIO.
>
> I'm not sure if Alexey's patch [1] illustates your idea. It calls
> tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
> VFIO doesn't know about this.
>
> I have to interpret this as VFIO firstly hand over device CC features
> and MMIO resources to IOMMUFD, so VFIO never cares about them.
>
> [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
There is also the PCI layer involved here and maybe PCI should be
participating in managing some of this. Like it makes a bit of sense
that PCI would block the FLR on platforms that require this?
Jason
On Wed, May 14, 2025 at 7:58 AM Tvrtko Ursulin
<tvrtko.ursulin(a)igalia.com> wrote:
>
>
> On 14/05/2025 14:57, Rob Clark wrote:
> > On Wed, May 14, 2025 at 3:01 AM Tvrtko Ursulin
> > <tvrtko.ursulin(a)igalia.com> wrote:
> >>
> >>
> >> On 13/05/2025 15:16, Rob Clark wrote:
> >>> On Fri, May 9, 2025 at 8:34 AM Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> wrote:
> >>>>
> >>>> Dma-fence objects currently suffer from a potential use after free problem
> >>>> where fences exported to userspace and other drivers can outlive the
> >>>> exporting driver, or the associated data structures.
> >>>>
> >>>> The discussion on how to address this concluded that adding reference
> >>>> counting to all the involved objects is not desirable, since it would need
> >>>> to be very wide reaching and could cause unloadable drivers if another
> >>>> entity would be holding onto a signaled fence reference potentially
> >>>> indefinitely.
> >>>>
> >>>> This patch enables the safe access by introducing and documenting a
> >>>> contract between fence exporters and users. It documents a set of
> >>>> contraints and adds helpers which a) drivers with potential to suffer from
> >>>> the use after free must use and b) users of the dma-fence API must use as
> >>>> well.
> >>>>
> >>>> Premise of the design has multiple sides:
> >>>>
> >>>> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> >>>> signalling a fence and freeing the driver private data associated with it.
> >>>>
> >>>> The grace period does not have to follow the signalling immediately but
> >>>> HAS to happen before data is freed.
> >>>>
> >>>> 2. Users of the dma-fence API marked with such requirement MUST contain
> >>>> the complete access to the data within a single code block guarded by the
> >>>> new dma_fence_access_begin() and dma_fence_access_end() helpers.
> >>>>
> >>>> The combination of the two ensures that whoever sees the
> >>>> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> >>>> valid fence->lock and valid data potentially accessed by the fence->ops
> >>>> virtual functions, until the call to dma_fence_access_end().
> >>>>
> >>>> 3. Module unload (fence->ops) disappearing is for now explicitly not
> >>>> handled. That would required a more complex protection, possibly needing
> >>>> SRCU instead of RCU to handle callers such as dma_fence_wait_timeout(),
> >>>> where race between dma_fence_enable_sw_signaling, signalling, and
> >>>> dereference of fence->ops->wait() would need a sleeping SRCU context.
> >>>>
> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> >>>> ---
> >>>> drivers/dma-buf/dma-fence.c | 69 +++++++++++++++++++++++++++++++++++++
> >>>> include/linux/dma-fence.h | 32 ++++++++++++-----
> >>>> 2 files changed, 93 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> >>>> index dc2456f68685..cfe1d7b79c22 100644
> >>>> --- a/drivers/dma-buf/dma-fence.c
> >>>> +++ b/drivers/dma-buf/dma-fence.c
> >>>> @@ -533,6 +533,7 @@ void dma_fence_release(struct kref *kref)
> >>>> struct dma_fence *fence =
> >>>> container_of(kref, struct dma_fence, refcount);
> >>>>
> >>>> + dma_fence_access_begin();
> >>>> trace_dma_fence_destroy(fence);
> >>>>
> >>>> if (WARN(!list_empty(&fence->cb_list) &&
> >>>> @@ -560,6 +561,8 @@ void dma_fence_release(struct kref *kref)
> >>>> fence->ops->release(fence);
> >>>> else
> >>>> dma_fence_free(fence);
> >>>> +
> >>>> + dma_fence_access_end();
> >>>> }
> >>>> EXPORT_SYMBOL(dma_fence_release);
> >>>>
> >>>> @@ -982,11 +985,13 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> >>>> */
> >>>> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> >>>> {
> >>>> + dma_fence_access_begin();
> >>>> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> >>>> dma_fence_driver_name(fence),
> >>>> dma_fence_timeline_name(fence),
> >>>> fence->seqno,
> >>>> dma_fence_is_signaled(fence) ? "" : "un");
> >>>> + dma_fence_access_end();
> >>>> }
> >>>> EXPORT_SYMBOL(dma_fence_describe);
> >>>>
> >>>> @@ -1033,3 +1038,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> >>>> __set_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags);
> >>>> }
> >>>> EXPORT_SYMBOL(dma_fence_init64);
> >>>> +
> >>>> +/**
> >>>> + * dma_fence_driver_name - Access the driver name
> >>>> + * @fence: the fence to query
> >>>> + *
> >>>> + * Returns a driver name backing the dma-fence implementation.
> >>>> + *
> >>>> + * IMPORTANT CONSIDERATION:
> >>>> + * Dma-fence contract stipulates that access to driver provided data (data not
> >>>> + * directly embedded into the object itself), such as the &dma_fence.lock and
> >>>> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> >>>> + * after the fence has been signalled. Drivers are allowed to free that data,
> >>>> + * and some do.
> >>>> + *
> >>>> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> >>>> + * between signalling the fence and freeing said data.
> >>>> + *
> >>>> + * As such access to the driver name is only valid inside a RCU locked section.
> >>>> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> >>>> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> >>>> + */
> >>>> +const char *dma_fence_driver_name(struct dma_fence *fence)
> >>>> +{
> >>>> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> >>>> + "rcu_read_lock() required for safe access to returned string");
> >>>> +
> >>>> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >>>> + return fence->ops->get_driver_name(fence);
> >>>> + else
> >>>> + return "detached-driver";
> >>>> +}
> >>>> +EXPORT_SYMBOL(dma_fence_driver_name);
> >>>> +
> >>>> +/**
> >>>> + * dma_fence_timeline_name - Access the timeline name
> >>>> + * @fence: the fence to query
> >>>> + *
> >>>> + * Returns a timeline name provided by the dma-fence implementation.
> >>>> + *
> >>>> + * IMPORTANT CONSIDERATION:
> >>>> + * Dma-fence contract stipulates that access to driver provided data (data not
> >>>> + * directly embedded into the object itself), such as the &dma_fence.lock and
> >>>> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> >>>> + * after the fence has been signalled. Drivers are allowed to free that data,
> >>>> + * and some do.
> >>>> + *
> >>>> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> >>>> + * between signalling the fence and freeing said data.
> >>>> + *
> >>>> + * As such access to the driver name is only valid inside a RCU locked section.
> >>>> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> >>>> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> >>>> + */
> >>>> +const char *dma_fence_timeline_name(struct dma_fence *fence)
> >>>> +{
> >>>> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> >>>> + "rcu_read_lock() required for safe access to returned string");
> >>>> +
> >>>> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >>>> + return fence->ops->get_driver_name(fence);
> >>>> + else
> >>>> + return "signaled-timeline";
> >>>
> >>> This means that trace_dma_fence_signaled() will get the wrong
> >>> timeline/driver name, which probably screws up perfetto and maybe
> >>> other tools.
> >>
> >> Do you think context and seqno are not enough for those tools and they
> >> actually rely on the names? It would sound weird if they decided to
> >> index anything on the names which are non-standardised between drivers,
> >> but I guess anything is possible.
> >
> > At some point perfetto uses the timeline name to put up a named fence
> > timeline, I'm not sure if it is using the name or context # for
> > subsequent fence events (namely, signalled). I'd have to check the
> > code and get back to you.
>
> If you can it would be useful. Presumably it saves the names from the
> start edge of fence lifetime. But again, who knows.
Ok, it looks like perfetto is ok... mostly..
DrmTracker::GetFenceTimelineByContext() will try to lookup the
timeline by context #, and then if that fails, create a new timeline
with the name from the trace event, and add it to the hashmap.
It might be that "signaled-timeline" shows up if the first event seen
is the fence-signaled event.
> > There is also gpuvis, which I guess does something similar, but
> > haven't looked into it. Idk if there are others.
>
> I know GpuVis uses DRM sched tracepoints since Pierre-Eric was
> explaining me about those in the context of tracing rework he did there.
> I am not sure about dma-fence tracepoints.
>
> +Pierre-Eric on the off chance you know from the top of your head how
> much GpuVis depends on them (dma-fence tracepoints).
>
> >>> Maybe it would work well enough just to move the
> >>> trace_dma_fence_signaled() call ahead of the test_and_set_bit()? Idk
> >>> if some things will start getting confused if they see that trace
> >>> multiple times.
> >>
> >> Another alternative is to make this tracepoint access the names
> >> directly. It is under the lock so guaranteed not to get freed with
> >> drivers which will be made compliant with the documented rules.
> >
> > I guess it would have been better if, other than dma_fence_init
> > tracepoint, later tracepoints didn't include the driver/timeline
> > name.. that would have forced the use of the context. But I guess too
> > late for that. Perhaps the least bad thing to do is use the locking?
>
> You mean this last alternative I mentioned? I think that will work fine.
> I'll wait a little bit longer for more potential comments before re-spi
> ning with that.
yes
> Were you able to test the series for your use case? Assuming it is not
> upstream msm since I don't immediately see a path in msm_fence which
> gets freed at runtime?
Not yet, but I think it should because it is the exact same problem
your igt test triggers.
This is with my VM_BIND series, which will dynamically create/teardown
sched entities
BR,
-R
On Wed, May 14, 2025 at 3:01 AM Tvrtko Ursulin
<tvrtko.ursulin(a)igalia.com> wrote:
>
>
> On 13/05/2025 15:16, Rob Clark wrote:
> > On Fri, May 9, 2025 at 8:34 AM Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> wrote:
> >>
> >> Dma-fence objects currently suffer from a potential use after free problem
> >> where fences exported to userspace and other drivers can outlive the
> >> exporting driver, or the associated data structures.
> >>
> >> The discussion on how to address this concluded that adding reference
> >> counting to all the involved objects is not desirable, since it would need
> >> to be very wide reaching and could cause unloadable drivers if another
> >> entity would be holding onto a signaled fence reference potentially
> >> indefinitely.
> >>
> >> This patch enables the safe access by introducing and documenting a
> >> contract between fence exporters and users. It documents a set of
> >> contraints and adds helpers which a) drivers with potential to suffer from
> >> the use after free must use and b) users of the dma-fence API must use as
> >> well.
> >>
> >> Premise of the design has multiple sides:
> >>
> >> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> >> signalling a fence and freeing the driver private data associated with it.
> >>
> >> The grace period does not have to follow the signalling immediately but
> >> HAS to happen before data is freed.
> >>
> >> 2. Users of the dma-fence API marked with such requirement MUST contain
> >> the complete access to the data within a single code block guarded by the
> >> new dma_fence_access_begin() and dma_fence_access_end() helpers.
> >>
> >> The combination of the two ensures that whoever sees the
> >> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> >> valid fence->lock and valid data potentially accessed by the fence->ops
> >> virtual functions, until the call to dma_fence_access_end().
> >>
> >> 3. Module unload (fence->ops) disappearing is for now explicitly not
> >> handled. That would required a more complex protection, possibly needing
> >> SRCU instead of RCU to handle callers such as dma_fence_wait_timeout(),
> >> where race between dma_fence_enable_sw_signaling, signalling, and
> >> dereference of fence->ops->wait() would need a sleeping SRCU context.
> >>
> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> >> ---
> >> drivers/dma-buf/dma-fence.c | 69 +++++++++++++++++++++++++++++++++++++
> >> include/linux/dma-fence.h | 32 ++++++++++++-----
> >> 2 files changed, 93 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> >> index dc2456f68685..cfe1d7b79c22 100644
> >> --- a/drivers/dma-buf/dma-fence.c
> >> +++ b/drivers/dma-buf/dma-fence.c
> >> @@ -533,6 +533,7 @@ void dma_fence_release(struct kref *kref)
> >> struct dma_fence *fence =
> >> container_of(kref, struct dma_fence, refcount);
> >>
> >> + dma_fence_access_begin();
> >> trace_dma_fence_destroy(fence);
> >>
> >> if (WARN(!list_empty(&fence->cb_list) &&
> >> @@ -560,6 +561,8 @@ void dma_fence_release(struct kref *kref)
> >> fence->ops->release(fence);
> >> else
> >> dma_fence_free(fence);
> >> +
> >> + dma_fence_access_end();
> >> }
> >> EXPORT_SYMBOL(dma_fence_release);
> >>
> >> @@ -982,11 +985,13 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> >> */
> >> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> >> {
> >> + dma_fence_access_begin();
> >> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> >> dma_fence_driver_name(fence),
> >> dma_fence_timeline_name(fence),
> >> fence->seqno,
> >> dma_fence_is_signaled(fence) ? "" : "un");
> >> + dma_fence_access_end();
> >> }
> >> EXPORT_SYMBOL(dma_fence_describe);
> >>
> >> @@ -1033,3 +1038,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> >> __set_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags);
> >> }
> >> EXPORT_SYMBOL(dma_fence_init64);
> >> +
> >> +/**
> >> + * dma_fence_driver_name - Access the driver name
> >> + * @fence: the fence to query
> >> + *
> >> + * Returns a driver name backing the dma-fence implementation.
> >> + *
> >> + * IMPORTANT CONSIDERATION:
> >> + * Dma-fence contract stipulates that access to driver provided data (data not
> >> + * directly embedded into the object itself), such as the &dma_fence.lock and
> >> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> >> + * after the fence has been signalled. Drivers are allowed to free that data,
> >> + * and some do.
> >> + *
> >> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> >> + * between signalling the fence and freeing said data.
> >> + *
> >> + * As such access to the driver name is only valid inside a RCU locked section.
> >> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> >> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> >> + */
> >> +const char *dma_fence_driver_name(struct dma_fence *fence)
> >> +{
> >> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> >> + "rcu_read_lock() required for safe access to returned string");
> >> +
> >> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >> + return fence->ops->get_driver_name(fence);
> >> + else
> >> + return "detached-driver";
> >> +}
> >> +EXPORT_SYMBOL(dma_fence_driver_name);
> >> +
> >> +/**
> >> + * dma_fence_timeline_name - Access the timeline name
> >> + * @fence: the fence to query
> >> + *
> >> + * Returns a timeline name provided by the dma-fence implementation.
> >> + *
> >> + * IMPORTANT CONSIDERATION:
> >> + * Dma-fence contract stipulates that access to driver provided data (data not
> >> + * directly embedded into the object itself), such as the &dma_fence.lock and
> >> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> >> + * after the fence has been signalled. Drivers are allowed to free that data,
> >> + * and some do.
> >> + *
> >> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> >> + * between signalling the fence and freeing said data.
> >> + *
> >> + * As such access to the driver name is only valid inside a RCU locked section.
> >> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> >> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> >> + */
> >> +const char *dma_fence_timeline_name(struct dma_fence *fence)
> >> +{
> >> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> >> + "rcu_read_lock() required for safe access to returned string");
> >> +
> >> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >> + return fence->ops->get_driver_name(fence);
> >> + else
> >> + return "signaled-timeline";
> >
> > This means that trace_dma_fence_signaled() will get the wrong
> > timeline/driver name, which probably screws up perfetto and maybe
> > other tools.
>
> Do you think context and seqno are not enough for those tools and they
> actually rely on the names? It would sound weird if they decided to
> index anything on the names which are non-standardised between drivers,
> but I guess anything is possible.
At some point perfetto uses the timeline name to put up a named fence
timeline, I'm not sure if it is using the name or context # for
subsequent fence events (namely, signalled). I'd have to check the
code and get back to you.
There is also gpuvis, which I guess does something similar, but
haven't looked into it. Idk if there are others.
> > Maybe it would work well enough just to move the
> > trace_dma_fence_signaled() call ahead of the test_and_set_bit()? Idk
> > if some things will start getting confused if they see that trace
> > multiple times.
>
> Another alternative is to make this tracepoint access the names
> directly. It is under the lock so guaranteed not to get freed with
> drivers which will be made compliant with the documented rules.
I guess it would have been better if, other than dma_fence_init
tracepoint, later tracepoints didn't include the driver/timeline
name.. that would have forced the use of the context. But I guess too
late for that. Perhaps the least bad thing to do is use the locking?
BR,
-R
I'm going to push patches #1-#6 to drm-misc-next.
They make sense as a stand alone cleanups anyway.
But that here needs a bit more documentation I think.
On 5/13/25 09:45, Tvrtko Ursulin wrote:
> Dma-fence objects currently suffer from a potential use after free problem
> where fences exported to userspace and other drivers can outlive the
> exporting driver, or the associated data structures.
>
> The discussion on how to address this concluded that adding reference
> counting to all the involved objects is not desirable, since it would need
> to be very wide reaching and could cause unloadable drivers if another
> entity would be holding onto a signaled fence reference potentially
> indefinitely.
>
> This patch enables the safe access by introducing and documenting a
> contract between fence exporters and users. It documents a set of
> contraints and adds helpers which a) drivers with potential to suffer from
> the use after free must use and b) users of the dma-fence API must use as
> well.
>
> Premise of the design has multiple sides:
>
> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> signalling a fence and freeing the driver private data associated with it.
That's a must have anyway, otherwise functions like dma_fence_get_rcu() won't work.
I hope that we have documented that somewhere, but I'm not 100% sure to be honest.
> The grace period does not have to follow the signalling immediately but
> HAS to happen before data is freed.
That is the new requirement we have to document somehow.
I'm not 100% sure but I think module unloading waits for an RCU grace period anyway.
> 2. Users of the dma-fence API marked with such requirement MUST contain
> the complete access to the data within a single code block guarded by the
> new dma_fence_access_begin() and dma_fence_access_end() helpers.
>
> The combination of the two ensures that whoever sees the
> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> valid fence->lock and valid data potentially accessed by the fence->ops
> virtual functions, until the call to dma_fence_access_end().
Mhm, how about returning copies of the string?
This is only for debugging anyway and kstrdup_const() isn't that costly.
Regards,
Christian.
>
> 3. Module unload (fence->ops) disappearing is for now explicitly not
> handled. That would required a more complex protection, possibly needing
> SRCU instead of RCU to handle callers such as dma_fence_wait_timeout(),
> where race between dma_fence_enable_sw_signaling, signalling, and
> dereference of fence->ops->wait() would need a sleeping SRCU context.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> ---
> drivers/dma-buf/dma-fence.c | 69 +++++++++++++++++++++++++++++++++++++
> include/linux/dma-fence.h | 32 ++++++++++++-----
> 2 files changed, 93 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index dc2456f68685..cfe1d7b79c22 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -533,6 +533,7 @@ void dma_fence_release(struct kref *kref)
> struct dma_fence *fence =
> container_of(kref, struct dma_fence, refcount);
>
> + dma_fence_access_begin();
> trace_dma_fence_destroy(fence);
>
> if (WARN(!list_empty(&fence->cb_list) &&
> @@ -560,6 +561,8 @@ void dma_fence_release(struct kref *kref)
> fence->ops->release(fence);
> else
> dma_fence_free(fence);
> +
> + dma_fence_access_end();
> }
> EXPORT_SYMBOL(dma_fence_release);
>
> @@ -982,11 +985,13 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> */
> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> {
> + dma_fence_access_begin();
> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> dma_fence_driver_name(fence),
> dma_fence_timeline_name(fence),
> fence->seqno,
> dma_fence_is_signaled(fence) ? "" : "un");
> + dma_fence_access_end();
> }
> EXPORT_SYMBOL(dma_fence_describe);
>
> @@ -1033,3 +1038,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> __set_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags);
> }
> EXPORT_SYMBOL(dma_fence_init64);
> +
> +/**
> + * dma_fence_driver_name - Access the driver name
> + * @fence: the fence to query
> + *
> + * Returns a driver name backing the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> + */
> +const char *dma_fence_driver_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "rcu_read_lock() required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "detached-driver";
> +}
> +EXPORT_SYMBOL(dma_fence_driver_name);
> +
> +/**
> + * dma_fence_timeline_name - Access the timeline name
> + * @fence: the fence to query
> + *
> + * Returns a timeline name provided by the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> + */
> +const char *dma_fence_timeline_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "rcu_read_lock() required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "signaled-timeline";
> +}
> +EXPORT_SYMBOL(dma_fence_timeline_name);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index c5ac37e10d85..b39e430142ea 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -377,15 +377,31 @@ bool dma_fence_remove_callback(struct dma_fence *fence,
> struct dma_fence_cb *cb);
> void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>
> -static inline const char *dma_fence_driver_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_driver_name(fence);
> -}
> +/**
> + * DOC: Safe external access to driver provided object members
> + *
> + * All data not stored directly in the dma-fence object, such as the
> + * &dma_fence.lock and memory potentially accessed by functions in the
> + * &dma_fence.ops table, MUST NOT be accessed after the fence has been signalled
> + * because after that point drivers are allowed to free it.
> + *
> + * All code accessing that data via the dma-fence API (or directly, which is
> + * discouraged), MUST make sure to contain the complete access within a
> + * &dma_fence_access_begin and &dma_fence_access_end pair.
> + *
> + * Some dma-fence API handles this automatically, while other, as for example
> + * &dma_fence_driver_name and &dma_fence_timeline_name, leave that
> + * responsibility to the caller.
> + *
> + * To enable this scheme to work drivers MUST ensure a RCU grace period elapses
> + * between signalling the fence and freeing the said data.
> + *
> + */
> +#define dma_fence_access_begin rcu_read_lock
> +#define dma_fence_access_end rcu_read_unlock
>
> -static inline const char *dma_fence_timeline_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_timeline_name(fence);
> -}
> +const char *dma_fence_driver_name(struct dma_fence *fence);
> +const char *dma_fence_timeline_name(struct dma_fence *fence);
>
> /**
> * dma_fence_is_signaled_locked - Return an indication if the fence
Dear All,
This patchset fixes the incorrect use of dma_sync_sg_*() calls in
media and related drivers. They are replaced with much safer
dma_sync_sgtable_*() variants, which take care of passing the proper
number of elements for the sync operation.
Best regards
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Change log:
v3: added cc: stable to tags
v2: fixes typos and added cc: stable
Patch summary:
Marek Szyprowski (3):
media: videobuf2: use sgtable-based scatterlist wrappers
udmabuf: use sgtable-based scatterlist wrappers
media: omap3isp: use sgtable-based scatterlist wrappers
drivers/dma-buf/udmabuf.c | 5 ++---
drivers/media/common/videobuf2/videobuf2-dma-sg.c | 4 ++--
drivers/media/platform/ti/omap3isp/ispccdc.c | 8 ++++----
drivers/media/platform/ti/omap3isp/ispstat.c | 6 ++----
4 files changed, 10 insertions(+), 13 deletions(-)
--
2.34.1
On 5/13/25 04:06, Hyejeong Choi wrote:
> smp_store_mb() inserts memory barrier after storing operation.
> It is different with what the comment is originally aiming so Null
> pointer dereference can be happened if memory update is reordered.
>
> Signed-off-by: Hyejeong Choi <hjeong.choi(a)samsung.com>
I've reviewed, add CC stable and Fixes tags and pushed it to drm-misc-fixes.
Thanks,
Christian.
> ---
> drivers/dma-buf/dma-resv.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index 5f8d010516f0..b1ef4546346d 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -320,8 +320,9 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
> count++;
>
> dma_resv_list_set(fobj, i, fence, usage);
> - /* pointer update must be visible before we extend the num_fences */
> - smp_store_mb(fobj->num_fences, count);
> + /* fence update must be visible before we extend the num_fences */
> + smp_wmb();
> + fobj->num_fences = count;
> }
> EXPORT_SYMBOL(dma_resv_add_fence);
>
>
>
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian
König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel
test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in
selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei
Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei
Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
v4: https://lore.kernel.org/all/20250508182025.2961555-1-tjmercier@google.com
v4 -> v5:
Add Christian's Acks to all patches
Add Song Liu's Acks
Move BTF_ID_LIST_SINGLE and DEFINE_BPF_ITER_FUNC closer to usage per
Song Liu
Fix open-coded iterator comment style per Song Liu
Move iterator termination check to its own subtest per Song Liu
Rework selftest buffer creation per Song Liu
Fix spacing in sanitize_string per BPF CI
v5: https://lore.kernel.org/all/20250512174036.266796-1-tjmercier@google.com
v5 -> v6:
Song Liu:
Init test buffer FDs to -1
Zero-init udmabuf_create for future proofing
Bail early for iterator fd/FILE creation failure
Dereference char ptr to check for NUL in sanitize_string()
Move map insertion from create_test_buffers() to test_dmabuf_iter()
Add ACK to selftests/bpf: Add test for open coded dmabuf_iter
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 ++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 150 +++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 285 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 622 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog
On Fri, May 9, 2025 at 8:34 AM Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> wrote:
>
> Dma-fence objects currently suffer from a potential use after free problem
> where fences exported to userspace and other drivers can outlive the
> exporting driver, or the associated data structures.
>
> The discussion on how to address this concluded that adding reference
> counting to all the involved objects is not desirable, since it would need
> to be very wide reaching and could cause unloadable drivers if another
> entity would be holding onto a signaled fence reference potentially
> indefinitely.
>
> This patch enables the safe access by introducing and documenting a
> contract between fence exporters and users. It documents a set of
> contraints and adds helpers which a) drivers with potential to suffer from
> the use after free must use and b) users of the dma-fence API must use as
> well.
>
> Premise of the design has multiple sides:
>
> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> signalling a fence and freeing the driver private data associated with it.
>
> The grace period does not have to follow the signalling immediately but
> HAS to happen before data is freed.
>
> 2. Users of the dma-fence API marked with such requirement MUST contain
> the complete access to the data within a single code block guarded by the
> new dma_fence_access_begin() and dma_fence_access_end() helpers.
>
> The combination of the two ensures that whoever sees the
> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> valid fence->lock and valid data potentially accessed by the fence->ops
> virtual functions, until the call to dma_fence_access_end().
>
> 3. Module unload (fence->ops) disappearing is for now explicitly not
> handled. That would required a more complex protection, possibly needing
> SRCU instead of RCU to handle callers such as dma_fence_wait_timeout(),
> where race between dma_fence_enable_sw_signaling, signalling, and
> dereference of fence->ops->wait() would need a sleeping SRCU context.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> ---
> drivers/dma-buf/dma-fence.c | 69 +++++++++++++++++++++++++++++++++++++
> include/linux/dma-fence.h | 32 ++++++++++++-----
> 2 files changed, 93 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index dc2456f68685..cfe1d7b79c22 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -533,6 +533,7 @@ void dma_fence_release(struct kref *kref)
> struct dma_fence *fence =
> container_of(kref, struct dma_fence, refcount);
>
> + dma_fence_access_begin();
> trace_dma_fence_destroy(fence);
>
> if (WARN(!list_empty(&fence->cb_list) &&
> @@ -560,6 +561,8 @@ void dma_fence_release(struct kref *kref)
> fence->ops->release(fence);
> else
> dma_fence_free(fence);
> +
> + dma_fence_access_end();
> }
> EXPORT_SYMBOL(dma_fence_release);
>
> @@ -982,11 +985,13 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> */
> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> {
> + dma_fence_access_begin();
> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> dma_fence_driver_name(fence),
> dma_fence_timeline_name(fence),
> fence->seqno,
> dma_fence_is_signaled(fence) ? "" : "un");
> + dma_fence_access_end();
> }
> EXPORT_SYMBOL(dma_fence_describe);
>
> @@ -1033,3 +1038,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> __set_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags);
> }
> EXPORT_SYMBOL(dma_fence_init64);
> +
> +/**
> + * dma_fence_driver_name - Access the driver name
> + * @fence: the fence to query
> + *
> + * Returns a driver name backing the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> + */
> +const char *dma_fence_driver_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "rcu_read_lock() required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "detached-driver";
> +}
> +EXPORT_SYMBOL(dma_fence_driver_name);
> +
> +/**
> + * dma_fence_timeline_name - Access the timeline name
> + * @fence: the fence to query
> + *
> + * Returns a timeline name provided by the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &dma_fence_access_being and &dma_fence_access_end pair.
> + */
> +const char *dma_fence_timeline_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "rcu_read_lock() required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "signaled-timeline";
This means that trace_dma_fence_signaled() will get the wrong
timeline/driver name, which probably screws up perfetto and maybe
other tools.
Maybe it would work well enough just to move the
trace_dma_fence_signaled() call ahead of the test_and_set_bit()? Idk
if some things will start getting confused if they see that trace
multiple times.
Maybe a better solution would be to add a new bit for this purpose,
which is set after the tracepoint in
dma_fence_signal_timestamp_locked().
(trace_dma_fence_destroy() will I guess be messed up regardless.. it
doesn't look like perfetto cares about this tracepoint, so maybe that
is ok. It doesn't seem so useful.)
BR,
-R
> +}
> +EXPORT_SYMBOL(dma_fence_timeline_name);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index c814a86087f8..c8a9915eb360 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -387,15 +387,31 @@ bool dma_fence_remove_callback(struct dma_fence *fence,
> struct dma_fence_cb *cb);
> void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>
> -static inline const char *dma_fence_driver_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_driver_name(fence);
> -}
> +/**
> + * DOC: Safe external access to driver provided object members
> + *
> + * All data not stored directly in the dma-fence object, such as the
> + * &dma_fence.lock and memory potentially accessed by functions in the
> + * &dma_fence.ops table, MUST NOT be accessed after the fence has been signalled
> + * because after that point drivers are allowed to free it.
> + *
> + * All code accessing that data via the dma-fence API (or directly, which is
> + * discouraged), MUST make sure to contain the complete access within a
> + * &dma_fence_access_begin and &dma_fence_access_end pair.
> + *
> + * Some dma-fence API handles this automatically, while other, as for example
> + * &dma_fence_driver_name and &dma_fence_timeline_name, leave that
> + * responsibility to the caller.
> + *
> + * To enable this scheme to work drivers MUST ensure a RCU grace period elapses
> + * between signalling the fence and freeing the said data.
> + *
> + */
> +#define dma_fence_access_begin rcu_read_lock
> +#define dma_fence_access_end rcu_read_unlock
>
> -static inline const char *dma_fence_timeline_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_timeline_name(fence);
> -}
> +const char *dma_fence_driver_name(struct dma_fence *fence);
> +const char *dma_fence_timeline_name(struct dma_fence *fence);
>
> /**
> * dma_fence_is_signaled_locked - Return an indication if the fence
> --
> 2.48.0
>