On Mon, May 12, 2025 at 1:30 PM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, May 12, 2025 at 10:41 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > Use the same test buffers as the traditional iterator and a new BPF map
> > to verify the test buffers can be found with the open coded dmabuf
> > iterator.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > Acked-by: Christian König <christian.koenig(a)amd.com>
>
> Acked-by: Song Liu <song(a)kernel.org>
Thanks, I'll send v6 this afternoon or tomorrow morning with all changes.
> With a nitpick below.
>
> [...]
>
> >
> > -static int create_test_buffers(void)
> > +static int create_test_buffers(int map_fd)
> > {
> > + bool f = false;
> > +
> > udmabuf = create_udmabuf();
> > sysheap_dmabuf = create_sys_heap_dmabuf();
> >
> > if (udmabuf < 0 || sysheap_dmabuf < 0)
> > return -1;
> >
> > - return 0;
> > + return bpf_map_update_elem(map_fd, udmabuf_test_buffer_name, &f, BPF_ANY) ||
> > + bpf_map_update_elem(map_fd, sysheap_test_buffer_name, &f, BPF_ANY);
>
> nit: Instead of passing map_fd in here, we can just call
> bpf_map_update_elem() in test_dmabuf_iter()
>
> [...]
>
On 5/12/25 11:14, Tvrtko Ursulin wrote:
>
> On 12/05/2025 09:19, Christian König wrote:
>> On 5/9/25 17:33, Tvrtko Ursulin wrote:
>>> With the goal of reducing the need for drivers to touch fence->ops, we
>>> add explicit flags for struct dma_fence_array and struct dma_fence_chain
>>> and make the respective helpers (dma_fence_is_array() and
>>> dma_fence_is_chain()) use them.
>>>
>>> This also allows us to remove the exported symbols for the respective
>>> operation tables.
>>
>> That looks like overkill to me. We don't de-reference the ops for the check, instead just the values are compared.
>>
>> Since the array and chain are always build in that should be completely unproblematic for driver unload.
>
> You are right this is not strictly needed. Idea was just to reduce any access to ops as much as we can and this fell under that scope.
>
> Another benefit one could perhaps argue is two fewer EXPORT_SYMBOLs, which is perhaps a little bit cleaner design (less exporting of implementation details to the outside), but it is not a super strong argument.
I would rather say that using the symbols improves things. Background is that otherwise every driver could accidentally or because of malicious intend set this flag.
The symbol is not so easily changeable.
Regards,
Christian.
>
> If we will not be going for this one then I would be taking 1/13 via drm-intel-gt-next.
>
> Regards,
>
> Tvrtko
>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>>> ---
>>> drivers/dma-buf/dma-fence-array.c | 2 +-
>>> drivers/dma-buf/dma-fence-chain.c | 2 +-
>>> include/linux/dma-fence.h | 9 ++++-----
>>> 3 files changed, 6 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
>>> index 6657d4b30af9..daf444f5d228 100644
>>> --- a/drivers/dma-buf/dma-fence-array.c
>>> +++ b/drivers/dma-buf/dma-fence-array.c
>>> @@ -167,7 +167,6 @@ const struct dma_fence_ops dma_fence_array_ops = {
>>> .release = dma_fence_array_release,
>>> .set_deadline = dma_fence_array_set_deadline,
>>> };
>>> -EXPORT_SYMBOL(dma_fence_array_ops);
>>> /**
>>> * dma_fence_array_alloc - Allocate a custom fence array
>>> @@ -207,6 +206,7 @@ void dma_fence_array_init(struct dma_fence_array *array,
>>> spin_lock_init(&array->lock);
>>> dma_fence_init(&array->base, &dma_fence_array_ops, &array->lock,
>>> context, seqno);
>>> + __set_bit(DMA_FENCE_FLAG_ARRAY_BIT, &array->base.flags);
>>> init_irq_work(&array->work, irq_dma_fence_array_work);
>>> atomic_set(&array->num_pending, signal_on_any ? 1 : num_fences);
>>> diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
>>> index a8a90acf4f34..f4abe41fb092 100644
>>> --- a/drivers/dma-buf/dma-fence-chain.c
>>> +++ b/drivers/dma-buf/dma-fence-chain.c
>>> @@ -225,7 +225,6 @@ const struct dma_fence_ops dma_fence_chain_ops = {
>>> .release = dma_fence_chain_release,
>>> .set_deadline = dma_fence_chain_set_deadline,
>>> };
>>> -EXPORT_SYMBOL(dma_fence_chain_ops);
>>> /**
>>> * dma_fence_chain_init - initialize a fence chain
>>> @@ -263,6 +262,7 @@ void dma_fence_chain_init(struct dma_fence_chain *chain,
>>> dma_fence_init64(&chain->base, &dma_fence_chain_ops, &chain->lock,
>>> context, seqno);
>>> + __set_bit(DMA_FENCE_FLAG_CHAIN_BIT, &chain->base.flags);
>>> /*
>>> * Chaining dma_fence_chain container together is only allowed through
>>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>>> index ac6535716dbe..5bafd0a5f1f1 100644
>>> --- a/include/linux/dma-fence.h
>>> +++ b/include/linux/dma-fence.h
>>> @@ -98,6 +98,8 @@ struct dma_fence {
>>> enum dma_fence_flag_bits {
>>> DMA_FENCE_FLAG_SEQNO64_BIT,
>>> + DMA_FENCE_FLAG_ARRAY_BIT,
>>> + DMA_FENCE_FLAG_CHAIN_BIT,
>>> DMA_FENCE_FLAG_SIGNALED_BIT,
>>> DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>> DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
>>> @@ -632,9 +634,6 @@ struct dma_fence *dma_fence_get_stub(void);
>>> struct dma_fence *dma_fence_allocate_private_stub(ktime_t timestamp);
>>> u64 dma_fence_context_alloc(unsigned num);
>>> -extern const struct dma_fence_ops dma_fence_array_ops;
>>> -extern const struct dma_fence_ops dma_fence_chain_ops;
>>> -
>>> /**
>>> * dma_fence_is_array - check if a fence is from the array subclass
>>> * @fence: the fence to test
>>> @@ -643,7 +642,7 @@ extern const struct dma_fence_ops dma_fence_chain_ops;
>>> */
>>> static inline bool dma_fence_is_array(struct dma_fence *fence)
>>> {
>>> - return fence->ops == &dma_fence_array_ops;
>>> + return test_bit(DMA_FENCE_FLAG_ARRAY_BIT, &fence->flags);
>>> }
>>> /**
>>> @@ -654,7 +653,7 @@ static inline bool dma_fence_is_array(struct dma_fence *fence)
>>> */
>>> static inline bool dma_fence_is_chain(struct dma_fence *fence)
>>> {
>>> - return fence->ops == &dma_fence_chain_ops;
>>> + return test_bit(DMA_FENCE_FLAG_CHAIN_BIT, &fence->flags);
>>> }
>>> /**
>>
>
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian
König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel
test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in
selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei
Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei
Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
v4: https://lore.kernel.org/all/20250508182025.2961555-1-tjmercier@google.com
v4 -> v5:
Add Christian's Acks to all patches
Add Song Liu's Acks
Move BTF_ID_LIST_SINGLE and DEFINE_BPF_ITER_FUNC closer to usage per
Song Liu
Fix open-coded iterator comment style per Song Liu
Move iterator termination check to its own subtest per Song Liu
Rework selftest buffer creation per Song Liu
Fix spacing in sanitize_string per BPF CI
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 +++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 150 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 276 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 613 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog
On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
> > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> > > just about managing the translation control of the device.
> >
> > I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> > it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> > TDISP LOCKED state, so that device behaves differently from before. Then
> > why put it in IOMMUFD?
>
>
> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> IOMMU on the host CPU. The device does not know about the VM, it
> just enables/disables encryption by a request from the CPU (those
> start/stop interface commands). And IOMMUFD won't be doing DOE, the
> platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
>
> We probably should notify VFIO about the state transition but I do
> not know VFIO would want to do in response.
We have an awkward fit for what CCA people are doing to the various
Linux APIs. Looking somewhat maximally across all the arches a "bind"
for a CC vPCI device creation operation does:
- Setup the CPU page tables for the VM to have access to the MMIO
- Revoke hypervisor access to the MMIO
- Setup the vIOMMU to understand the vPCI device
- Take over control of some of the IOVA translation, at least for T=1,
and route to the the vIOMMU
- Register the vPCI with any attestation functions the VM might use
- Do some DOE stuff to manage/validate TDSIP/etc
So we have interactions of things controlled by PCI, KVM, VFIO, and
iommufd all mushed together.
iommufd is the only area that already has a handle to all the required
objects:
- The physical PCI function
- The CC vIOMMU object
- The KVM FD
- The CC vPCI object
Which is why I have been thinking it is the right place to manage
this.
It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
stays in VFIO.
> > > So your issue is you need to shoot down the dmabuf during vPCI device
> > > destruction?
> >
> > I assume "vPCI device" refers to assigned device in both shared mode &
> > prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
> > a.k.a. when assigned device is converting from private to shared.
> > Then recover the dmabuf after TSM unbind. The device could still work
> > in VM in shared mode.
What are you trying to protect with this? Is there some intelism where
you can't have references to encrypted MMIO pages?
> > What I really want is, one SW component to manage MMIO dmabuf, secure
> > iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
> > these ops are interconnected according to secure firmware's requirement.
>
> This SW component is QEMU. It knows about FLRs and other config
> space things, it can destroy all these IOMMUFD objects and talk to
> VFIO too, I've tried, so far it is looking easier to manage. Thanks,
Yes, qemu should be sequencing this. The kernel only needs to enforce
any rules required to keep the system from crashing.
Jason
On 5/12/25 13:12, Hyejeong Choi wrote:
> smp_store_mb() inserts memory barrier after storing operation.
> It is different with what the comment is originally aiming so Null
> pointer dereference can be happened if memory update is reordered.
>
> Signed-off-by: Hyejeong Choi <hjeong.choi(a)samsung.com>
> ---
> drivers/dma-buf/dma-resv.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index 5f8d010516f0..52af5c7430da 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -320,8 +320,9 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
> count++;
>
> dma_resv_list_set(fobj, i, fence, usage);
> - /* pointer update must be visible before we extend the num_fences */
> - smp_store_mb(fobj->num_fences, count);
> + /* fence update must be visible before we extend the num_fences */
> + smp_wmb();
> + WRITE_ONCE(fobj->num_fences, count);
The WRITE_ONCE isn't necessary since smp_wmb() implies a compiler barrier, but apart from that really good catch.
Can you modify the patch and re-send? I will be pushing it to -fixes ASAP.
Regards,
Christian.
> }
> EXPORT_SYMBOL(dma_resv_add_fence);
>
>
>
On 5/9/25 17:47, Matthew Brost wrote:
> On Fri, May 09, 2025 at 04:33:40PM +0100, Tvrtko Ursulin wrote:
>> Replace open-coded helper with the subsystem one.
>>
>
> You probably can just send this one by itself as it good cleanup and
> independent.
>
> Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Any objections that I start to push those patches to drm-misc-next or do you want to take this one through the i915 branch?
Regards,
Christian.
>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>> ---
>> drivers/gpu/drm/i915/gem/i915_gem_wait.c | 7 +------
>> 1 file changed, 1 insertion(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_wait.c b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
>> index 7127e90c1a8f..991666fd9f85 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_wait.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
>> @@ -106,11 +106,6 @@ static void fence_set_priority(struct dma_fence *fence,
>> rcu_read_unlock();
>> }
>>
>> -static inline bool __dma_fence_is_chain(const struct dma_fence *fence)
>> -{
>> - return fence->ops == &dma_fence_chain_ops;
>> -}
>> -
>> void i915_gem_fence_wait_priority(struct dma_fence *fence,
>> const struct i915_sched_attr *attr)
>> {
>> @@ -126,7 +121,7 @@ void i915_gem_fence_wait_priority(struct dma_fence *fence,
>>
>> for (i = 0; i < array->num_fences; i++)
>> fence_set_priority(array->fences[i], attr);
>> - } else if (__dma_fence_is_chain(fence)) {
>> + } else if (dma_fence_is_chain(fence)) {
>> struct dma_fence *iter;
>>
>> /* The chain is ordered; if we boost the last, we boost all */
>> --
>> 2.48.0
>>
On Fri, May 9, 2025 at 2:58 PM Song Liu <song(a)kernel.org> wrote:
>
> On Fri, May 9, 2025 at 2:43 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> [...]
> > >
> > > Personally, I would prefer we just merge all the logic of
> > > create_udmabuf() and create_sys_heap_dmabuf()
> > > into create_test_buffers().
> >
> > That's a lot of different stuff to put in one place. How about
> > returning file descriptors from the buffer create functions while
> > having them clean up after themselves:
>
> I do like this version better. Some nitpicks though.
>
> >
> > -static int memfd, udmabuf;
> > +static int udmabuf;
>
> About this, and ...
>
> > static const char udmabuf_test_buffer_name[DMA_BUF_NAME_LEN] =
> > "udmabuf_test_buffer_for_iter";
> > static size_t udmabuf_test_buffer_size;
> > static int sysheap_dmabuf;
> > static const char sysheap_test_buffer_name[DMA_BUF_NAME_LEN] =
> > "sysheap_test_buffer_for_iter";
> > static size_t sysheap_test_buffer_size;
> >
> > -static int create_udmabuf(int map_fd)
> > +static int create_udmabuf(void)
> > {
> > struct udmabuf_create create;
> > - int dev_udmabuf;
> > - bool f = false;
> > + int dev_udmabuf, memfd, udmabuf;
> .. here.
>
> It is not ideal to have a global udmabuf and a local udmabuf.
> If we want the global version, let's rename the local one.
Ok let me change up the name of the aliasing variable to local_udmabuf.
> [...]
>
> >
> > static int create_test_buffers(int map_fd)
> > {
> > - int ret;
> > + bool f = false;
> > +
> > + udmabuf = create_udmabuf();
> > + sysheap_dmabuf = create_sys_heap_dmabuf();
> >
> > - ret = create_udmabuf(map_fd);
> > - if (ret)
> > - return ret;
> > + if (udmabuf < 0 || sysheap_dmabuf < 0)
> > + return -1;
>
> We also need destroy_test_buffers() on the error path here,
> or at the caller.
The caller does currently check to decide if it should bother running
the tests or not, and calls destroy_test_buffers() if not.
> > - return create_sys_heap_dmabuf(map_fd);
> > + return bpf_map_update_elem(map_fd, udmabuf_test_buffer_name,
> > &f, BPF_ANY) ||
> > + bpf_map_update_elem(map_fd, sysheap_test_buffer_name,
> > &f, BPF_ANY);
> > }
> >
> > static void destroy_test_buffers(void)
> > {
> > close(udmabuf);
> > - close(memfd);
> > close(sysheap_dmabuf);
>
> For the two global fds, let's reset them to -1 right after close().
>
> Thanks,
> Song
Will do, thanks.
On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
> On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
> > On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> > > Ping?
> >
> > Sorry for late reply from vacation.
> >
> > > Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
> >
> > As disscussed in the thread, this kAPI is not well considered but IIUC
> > the concept of "importer mapping" is still valid. We need more
> > investigation about all the needs - P2P, CC memory, private bus
> > channel, and work out a formal API.
> >
> > However in last few months I'm focusing on high level TIO flow - TSM
> > framework, IOMMUFD based bind/unbind, so no much progress here and is
> > still using this temporary kAPI. But as long as "importer mapping" is
> > alive, the dmabuf fd for KVM is still valid and we could enable TIO
> > based on that.
>
> Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
> recently, the IOCTL is against iommufd_device.
I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
just about managing the translation control of the device.
> According to Jason's
> opinion [1], TSM bind/unbind should be called against iommufd_device,
> then I need to do the same for dmabuf. This is because Intel TDX
> Connect enforces a specific operation sequence between TSM unbind & MMIO
> unmap:
>
> 1. STOP TDI via TDISP message STOP_INTERFACE
> 2. Private MMIO unmap from Secure EPT
> 3. Trusted Device Context Table cleanup for the TDI
> 4. TDI ownership reclaim and metadata free
So your issue is you need to shoot down the dmabuf during vPCI device
destruction?
VFIO also needs to shoot down the MMIO during things like FLR
I don't think moving to iommufd really fixes it, it sounds like you
need more coordination between the two parts??
Jason
On Thu, May 8, 2025 at 5:36 PM Song Liu <song(a)kernel.org> wrote:
>
> On Thu, May 8, 2025 at 11:20 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> [...]
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..35745f4ce0f8
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> > @@ -0,0 +1,224 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Google */
> > +
> > +#include <test_progs.h>
> > +#include <bpf/libbpf.h>
> > +#include <bpf/btf.h>
> > +#include "dmabuf_iter.skel.h"
> > +
> > +#include <fcntl.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +
> > +#include <linux/dma-buf.h>
> > +#include <linux/dma-heap.h>
> > +#include <linux/udmabuf.h>
> > +
> > +static int memfd, udmabuf;
>
> Global fds are weird. AFAICT, we don't really need them
> to be global? If we really need them to be global, please
> initialize them to -1, just in case we close(0) by accident.
Hmm, no we don't really need them to be global but I didn't really
want to pass all these variables around to all the setup and test
functions. The fd lifetimes are nearly the whole program lifetime
anyways, and just need to exist without actually being used for
anything. I'll add the -1 initialization as you suggest. If udmabuf
creation failed, we would have done a close(0) in
destroy_test_buffers() on the sysheap_dmabuf fd.
> > +static const char udmabuf_test_buffer_name[DMA_BUF_NAME_LEN] = "udmabuf_test_buffer_for_iter";
> > +static size_t udmabuf_test_buffer_size;
> > +static int sysheap_dmabuf;
> > +static const char sysheap_test_buffer_name[DMA_BUF_NAME_LEN] = "sysheap_test_buffer_for_iter";
> > +static size_t sysheap_test_buffer_size;
On Thu, May 8, 2025 at 5:28 PM Song Liu <song(a)kernel.org> wrote:
>
> On Thu, May 8, 2025 at 11:20 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > This open coded iterator allows for more flexibility when creating BPF
> > programs. It can support output in formats other than text. With an open
> > coded iterator, a single BPF program can traverse multiple kernel data
> > structures (now including dmabufs), allowing for more efficient analysis
> > of kernel data compared to multiple reads from procfs, sysfs, or
> > multiple traditional BPF iterator invocations.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
>
> Acked-by: Song Liu <song(a)kernel.org>
>
> With one nitpick below:
>
> > ---
> > kernel/bpf/dmabuf_iter.c | 47 ++++++++++++++++++++++++++++++++++++++++
> > kernel/bpf/helpers.c | 5 +++++
> > 2 files changed, 52 insertions(+)
> >
> > diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
> > index 96b4ba7f0b2c..8049bdbc9efc 100644
> > --- a/kernel/bpf/dmabuf_iter.c
> > +++ b/kernel/bpf/dmabuf_iter.c
> > @@ -100,3 +100,50 @@ static int __init dmabuf_iter_init(void)
> > }
> >
> > late_initcall(dmabuf_iter_init);
> > +
> > +struct bpf_iter_dmabuf {
> > + /* opaque iterator state; having __u64 here allows to preserve correct
> > + * alignment requirements in vmlinux.h, generated from BTF
> > + */
>
> nit: comment style.
Added a leading /*
(This is copied from task_iter.c, which currently has the same style.)
> > + __u64 __opaque[1];
> > +} __aligned(8);
> > +
> > +/* Non-opaque version of bpf_iter_dmabuf */
> > +struct bpf_iter_dmabuf_kern {
> > + struct dma_buf *dmabuf;
> > +} __aligned(8);
> > +
> [...]
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 +++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 149 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 258 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 594 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1015.ga840276032-goog
On 5/8/25 11:13, Philipp Stanner wrote:
> On Mon, 2025-04-28 at 16:45 +0200, Christian König wrote:
>> On 4/24/25 15:02, Philipp Stanner wrote:
>>> In nouveau_fence_done(), a fence is checked for being signaled by
>>> manually evaluating the base fence's bits. This can be done in a
>>> canonical manner through dma_fence_is_signaled().
>>>
>>> Replace the bit-check with dma_fence_is_signaled().
>>>
>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>
>>
>> I think the bit check was used here as fast path optimization because
>> we later call dma_fence_is_signaled() anyway.
>
> That fast path optimization effectively saves one JMP instruction to
> the function.
What I meant was that we might completely drop that optimization. It looks like overkill and potentially hides bugs.
Regards,
Christian.
>
> I'm increasingly of the opinion that we shall work towards all DRM
> users only ever using infrastructure through officially documented API
> functions, without touching internal data structures.
>
>> Feel free to add my acked-by, but honestly what nouveau does here
>> looks rather suspicious to me.
>
> :)
>
>
> P.
>
>>
>> Regards,
>> Christian.
>>
>>> ---
>>> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> index fb9811938c82..d5654e26d5bc 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> @@ -253,7 +253,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
>>> struct nouveau_channel *chan;
>>> unsigned long flags;
>>>
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
>>>> base.flags))
>>> + if (dma_fence_is_signaled(&fence->base))
>>> return true;
>>>
>>> spin_lock_irqsave(&fctx->lock, flags);
>>
>
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 94 +++++--
include/linux/dma-buf.h | 5 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 149 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 258 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 591 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog