On 5/16/25 09:40, wangtao wrote:
>
>
>> -----Original Message-----
>> From: Christian König <christian.koenig(a)amd.com>
>> Sent: Thursday, May 15, 2025 10:26 PM
>> To: wangtao <tao.wangtao(a)honor.com>; sumit.semwal(a)linaro.org;
>> benjamin.gaignard(a)collabora.com; Brian.Starkey(a)arm.com;
>> jstultz(a)google.com; tjmercier(a)google.com
>> Cc: linux-media(a)vger.kernel.org; dri-devel(a)lists.freedesktop.org; linaro-
>> mm-sig(a)lists.linaro.org; linux-kernel(a)vger.kernel.org;
>> wangbintian(BintianWang) <bintian.wang(a)honor.com>; yipengxiang
>> <yipengxiang(a)honor.com>; liulu 00013167 <liulu.liu(a)honor.com>; hanfeng
>> 00012985 <feng.han(a)honor.com>
>> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
>> DMA_BUF_IOCTL_RW_FILE for system_heap
>>
>> On 5/15/25 16:03, wangtao wrote:
>>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
>>> Allocation: 32x32MB buffer creation
>>> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
>>> - Note: shmem shows excessive allocation time
>>
>> Yeah, that is something already noted by others as well. But that is
>> orthogonal.
>>
>>>
>>> Read 1024MB File:
>>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
>>> - Note: pin_user_pages_fast consumes majority CPU cycles
>>>
>>> Key function call timing: See details below.
>>
>> Those aren't valid, you are comparing different functionalities here.
>>
>> Please try using udmabuf with sendfile() as confirmed to be working by T.J.
> [wangtao] Using buffer IO with dmabuf file read/write requires one memory copy.
> Direct IO removes this copy to enable zero-copy. The sendfile system call
> reduces memory copies from two (read/write) to one. However, with udmabuf,
> sendfile still keeps at least one copy, failing zero-copy.
Then please work on fixing this.
Regards,
Christian.
>
> If udmabuf sendfile uses buffer IO (file page cache), read latency matches
> dmabuf buffer read, but allocation time is much longer.
> With Direct IO, the default 16-page pipe size makes it slower than buffer IO.
>
> Test data shows:
> udmabuf direct read is much faster than udmabuf sendfile.
> dmabuf direct read outperforms udmabuf direct read by a large margin.
>
> Issue: After udmabuf is mapped via map_dma_buf, apps using memfd or
> udmabuf for Direct IO might cause errors, but there are no safeguards to
> prevent this.
>
> Allocate 32x32MB buffer and read 1024 MB file Test:
> Metric | alloc (ms) | read (ms) | total (ms)
> -----------------------|------------|-----------|-----------
> udmabuf buffer read | 539 | 2017 | 2555
> udmabuf direct read | 522 | 658 | 1179
> udmabuf buffer sendfile| 505 | 1040 | 1546
> udmabuf direct sendfile| 510 | 2269 | 2780
> dmabuf buffer read | 51 | 1068 | 1118
> dmabuf direct read | 52 | 297 | 349
>
> udmabuf sendfile test steps:
> 1. Open data file(1024MB), get back_fd
> 2. Create memfd(32MB) # Loop steps 2-6
> 3. Allocate udmabuf with memfd
> 4. Call sendfile(memfd, back_fd)
> 5. Close memfd after sendfile
> 6. Close udmabuf
> 7. Close back_fd
>
>>
>> Regards,
>> Christian.
>
On Fri, May 16, 2025 at 02:19:45PM +0800, Xu Yilun wrote:
> > I don't know why you'd disable a viommu while the VM is running,
> > doesn't make sense.
>
> Here it means remove the CC setup for viommu, shared setup is still
> kept.
That might makes sense for the vPCI function, but not the vIOMMU. A
secure VIOMMU needs to be running at all times while the guest is
running. Perhaps it has no devices it can be used with, but it's
functionality has to be there because a driver in the VM will be
connected to it.
At most "bind" should only tell the already existing secure vIOMMU
that it is allowed to translate for a specific vPCI function.
Jason
On 5/16/25 11:49, wangtao wrote:
>>>> Please try using udmabuf with sendfile() as confirmed to be working by
>> T.J.
>>> [wangtao] Using buffer IO with dmabuf file read/write requires one
>> memory copy.
>>> Direct IO removes this copy to enable zero-copy. The sendfile system
>>> call reduces memory copies from two (read/write) to one. However, with
>>> udmabuf, sendfile still keeps at least one copy, failing zero-copy.
>>
>>
>> Then please work on fixing this.
> [wangtao] What needs fixing? Does sendfile achieve zero-copy?
> sendfile reduces memory copies (from 2 to 1) for network sockets,
> but still requires one copy and cannot achieve zero copies.
Well why not? See sendfile() is the designated Linux uAPI for moving data between two files, maybe splice() is also appropriate.
The memory file descriptor and your destination file are both a files. So those uAPIs apply.
Now what you suggest is to add a new IOCTL to do this in a very specific manner just for the system DMA-buf heap. And as far as I can see that is in general a complete no-go.
I mean I understand why you do this. Instead of improving the existing functionality you're just hacking something together because it is simple for you.
It might be possible to implement that generic for DMA-buf heaps if udmabuf allocation overhead can't be reduced, but that is then just the second step.
Regards,
Christian.
Hi!
I previously discussed this with Simona on IRC but would like to get
some feedback also from a wider audience:
We're planning to share dma-bufs using a fast interconnect in a way
similar to pcie-p2p:
The rough plan is to identify dma-bufs capable of sharing this way by
looking at the address of either the dma-buf ops and / or the
importer_ops to conclude it's a device using the same driver (or
possibly child driver) and then take special action when the dma-
addresses are obtained. Nothing visible outside of the xe driver or its
child driver.
Are there any absolute "DON'T"s or recommendations to keep in mind WRT
to this approach?
Thanks,
Thomas
On Fri, May 16, 2025 at 02:02:29AM +0800, Xu Yilun wrote:
> > IMHO, I think it might be helpful that you can picture out what are the
> > minimum requirements (function/life cycle) to the current IOMMUFD TSM
> > bind architecture:
> >
> > 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> > the TVM-HOST call.
> > 2. TDI acceptance is handled in guest_request() to accept the TDI after
> > the validation in the TVM)
>
> I'll try my best to brainstorm and make a flow in ASCII.
>
> (*) means new feature
>
>
> Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> ----- --------- ---- ---- ------- -------- ---
> 1. *Connect(IDE)
> 2. Init vdev
open /dev/vfio/XX as a VFIO action
Then VFIO attaches to IOMMUFD as an iommufd action creating the idev
> 3. *create dmabuf
> 4. *export dmabuf
> 5. create memslot
> 6. *import dmabuf
> 7. setup shared DMA
> 8. create hwpt
> 9. attach hwpt
> 10. kvm run
> 11.enum shared dev
> 12.*Connect(Bind)
> 13. *GHCI Bind
> 14. *Bind
> 15 CC viommu alloc
> 16. vdevice allloc
viommu and vdevice creation happen before KVM run. The vPCI function
is visible to the guest from the very start, even though it is in T=0
mode. If a platform does not require any special CC steps prior to KVM
run then it just has a NOP for these functions.
What you have here is some new BIND operation against the already
existing vdevice as we discussed earlier.
> 16. *attach vdev
> 17. *setup CC viommu
> 18 *tsm_bind
> 19. *bind
> 20.*Attest
> 21. *GHCI get CC info
> 22. *get CC info
> 23. *vdev guest req
> 24. *guest req
> 25.*Accept
> 26. *GHCI accept MMIO/DMA
> 27. *accept MMIO/DMA
> 28. *vdev guest req
> 29. *guest req
> 30. *map private MMIO
> 31. *GHCI start tdi
> 32. *start tdi
> 33. *vdev guest req
> 34. *guest req
This seems reasonable you want to have some generic RPC scheme to
carry messages fro mthe VM to the TSM tunneled through the iommufd
vdevice (because the vdevice has the vPCI ID, the KVM ID, the VIOMMU
id and so on)
> 35.Workload...
> 36.*disconnect(Unbind)
> 37. *GHCI unbind
> 38. *Unbind
> 39. *detach vdev
unbind vdev. vdev remains until kvm is stopped.
> 40. *tsm_unbind
> 41. *TDX stop tdi
> 42. *TDX disable mmio cb
> 43. *cb dmabuf revoke
> 44. *unmap private MMIO
> 45. *TDX disable dma cb
> 46. *cb disable CC viommu
I don't know why you'd disable a viommu while the VM is running,
doesn't make sense.
> 47. *TDX tdi free
> 48. *enable mmio
> 49. *cb dmabuf recover
> 50.workable shared dev
This is a nice chart, it would be good to see a comparable chart for
AMD and ARM
Jason
On Fri, May 16, 2025 at 12:04:04AM +0800, Xu Yilun wrote:
> > arches this was mostly invisible to the hypervisor?
>
> Attest & Accept can be invisible to hypervisor, or host just help pass
> data blobs between guest, firmware & device.
>
> Bind cannot be host agnostic, host should be aware not to touch device
> after Bind.
I'm not sure this is fully true, this could be a Intel thing. When the
vPCI is created the host can already know it shouldn't touch the PCI
device anymore and the secure world would enforce that when it gets a
bind command.
The fact it hasn't been locked out immediately at vPCI creation time
is sort of a detail that doesn't matter, IMHO.
> > It might be reasonable to have VFIO reach into iommufd to do that on
> > an already existing iommufd VDEVICE object. A little weird, but we
> > could probably make that work.
>
> Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
>
> ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> -> iommufd_device_attach_vdev()
> -> tsm_tdi_bind()
Not ATTACH, you wanted BIND. You could have a VFIO_DEVICE_BIND(iommufd
vdevice id)
> > sees VFIO remove the S-EPT and release the KVM, then have iommufd
> > destroy the VDEVICE object.
>
> Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> destroyed after all assigned CC vPCI devices are destroyed.
And KVM destroys the VM?
> Nowadays, VFIO already holds KVM reference, so we need
>
> close(vfio_fd)
> -> iommufd_device_detach_vdev()
This doesn't happen though, it destroys the normal device (idev) which
the vdevice is stacked on top of. You'd have to make normal device
destruction trigger vdevice destruction
> -> tsm_tdi_unbind()
> -> tdi stop
> -> callback to VFIO, dmabuf_move_notify(revoke)
> -> KVM unmap MMIO
> -> tdi metadata remove
This omits the viommu. It won't get destroyed until the iommufd
closes, so iommufd will be holding the kvm and it will do the final
put.
Jason
On 5/15/25 16:03, wangtao wrote:
> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> Allocation: 32x32MB buffer creation
> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> - Note: shmem shows excessive allocation time
Yeah, that is something already noted by others as well. But that is orthogonal.
>
> Read 1024MB File:
> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> - Note: pin_user_pages_fast consumes majority CPU cycles
>
> Key function call timing: See details below.
Those aren't valid, you are comparing different functionalities here.
Please try using udmabuf with sendfile() as confirmed to be working by T.J.
Regards,
Christian.
On Wed, May 14, 2025 at 2:00 PM Song Liu <song(a)kernel.org> wrote:
>
> On Tue, May 13, 2025 at 9:36 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > Use the same test buffers as the traditional iterator and a new BPF map
> > to verify the test buffers can be found with the open coded dmabuf
> > iterator.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > Acked-by: Christian König <christian.koenig(a)amd.com>
> > Acked-by: Song Liu <song(a)kernel.org>
> > ---
> > .../testing/selftests/bpf/bpf_experimental.h | 5 +++
> > .../selftests/bpf/prog_tests/dmabuf_iter.c | 41 +++++++++++++++++++
> > .../testing/selftests/bpf/progs/dmabuf_iter.c | 38 +++++++++++++++++
> > 3 files changed, 84 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> > index 6535c8ae3c46..5e512a1d09d1 100644
> > --- a/tools/testing/selftests/bpf/bpf_experimental.h
> > +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> > @@ -591,4 +591,9 @@ extern int bpf_iter_kmem_cache_new(struct bpf_iter_kmem_cache *it) __weak __ksym
> > extern struct kmem_cache *bpf_iter_kmem_cache_next(struct bpf_iter_kmem_cache *it) __weak __ksym;
> > extern void bpf_iter_kmem_cache_destroy(struct bpf_iter_kmem_cache *it) __weak __ksym;
> >
> > +struct bpf_iter_dmabuf;
> > +extern int bpf_iter_dmabuf_new(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +extern struct dma_buf *bpf_iter_dmabuf_next(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +extern void bpf_iter_dmabuf_destroy(struct bpf_iter_dmabuf *it) __weak __ksym;
> > +
> > #endif
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > index dc740bd0e2bd..6c2b0c3dbcd8 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > @@ -219,14 +219,52 @@ static void subtest_dmabuf_iter_check_default_iter(struct dmabuf_iter *skel)
> > close(iter_fd);
> > }
> >
> > +static void subtest_dmabuf_iter_check_open_coded(struct dmabuf_iter *skel, int map_fd)
> > +{
> > + LIBBPF_OPTS(bpf_test_run_opts, topts);
> > + char key[DMA_BUF_NAME_LEN];
> > + int err, fd;
> > + bool found;
> > +
> > + /* No need to attach it, just run it directly */
> > + fd = bpf_program__fd(skel->progs.iter_dmabuf_for_each);
> > +
> > + err = bpf_prog_test_run_opts(fd, &topts);
> > + if (!ASSERT_OK(err, "test_run_opts err"))
> > + return;
> > + if (!ASSERT_OK(topts.retval, "test_run_opts retval"))
> > + return;
> > +
> > + if (!ASSERT_OK(bpf_map_get_next_key(map_fd, NULL, key), "get next key"))
> > + return;
> > +
> > + do {
> > + ASSERT_OK(bpf_map_lookup_elem(map_fd, key, &found), "lookup");
> > + ASSERT_TRUE(found, "found test buffer");
>
> This check failed once in the CI, on s390:
>
> Error: #89/3 dmabuf_iter/open_coded
> 9309 subtest_dmabuf_iter_check_open_coded:PASS:test_run_opts err 0 nsec
> 9310 subtest_dmabuf_iter_check_open_coded:PASS:test_run_opts retval 0 nsec
> 9311 subtest_dmabuf_iter_check_open_coded:PASS:get next key 0 nsec
> 9312 subtest_dmabuf_iter_check_open_coded:PASS:lookup 0 nsec
> 9313 subtest_dmabuf_iter_check_open_coded:FAIL:found test buffer
> unexpected found test buffer: got FALSE
>
> But it passed in the rerun. It is probably a bit flakey. Maybe we need some
> barrier somewhere.
>
> Here is the failure:
>
> https://github.com/kernel-patches/bpf/actions/runs/15002058808/job/42234864…
>
> To see the log, you need to log in GitHub.
>
> Thanks,
> Song
Thanks, yeah I have been trying to run this locally today but still
working on setting up an environment for it. Daniel Xu thoughtfully
suggested I use a github PR to trigger CI, but I tried that last week
without success: https://github.com/kernel-patches/bpf/pull/8910
I'm not sure if this is the cause (doesn't show up on the runs that
pass) but I have no idea why that would be intermittently failing:
libbpf: Error in bpf_create_map_xattr(testbuf_hash): -EINVAL. Retrying
without BTF.
> > + } while (bpf_map_get_next_key(map_fd, key, key));
> > +}
>
> [...]
On Wed, May 14, 2025 at 1:53 PM Song Liu <song(a)kernel.org> wrote:
>
> On Tue, May 13, 2025 at 9:36 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > This test creates a udmabuf, and a dmabuf from the system dmabuf heap,
> > and uses a BPF program that prints dmabuf metadata with the new
> > dmabuf_iter to verify they can be found.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > Acked-by: Christian König <christian.koenig(a)amd.com>
>
> Acked-by: Song Liu <song(a)kernel.org>
Thanks.
>
> With one more comment below.
>
> [...]
>
> > diff --git a/tools/testing/selftests/bpf/progs/dmabuf_iter.c b/tools/testing/selftests/bpf/progs/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..2a1b5397196d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/dmabuf_iter.c
> > @@ -0,0 +1,53 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Google LLC */
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_core_read.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +/* From uapi/linux/dma-buf.h */
> > +#define DMA_BUF_NAME_LEN 32
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/*
> > + * Fields output by this iterator are delimited by newlines. Convert any
> > + * newlines in user-provided printed strings to spaces.
> > + */
> > +static void sanitize_string(char *src, size_t size)
> > +{
> > + for (char *c = src; *c && (size_t)(c - src) < size; ++c)
>
> We should do the size check first, right? IOW:
>
> for (char *c = src; (size_t)(c - src) < size && *c; ++c)
Yeah if you call the function with size = 0, which is kinda
questionable and not possible with the non-zero array size that is
tied to immutable UAPI. Let's change it like you suggest.
>
> > + if (*c == '\n')
> > + *c = ' ';
> > +}
> > +
> [...]