On 4/29/25 08:39, Simona Vetter wrote:
> Catching up after spring break, hence the late reply ...
>
> On Fri, Apr 11, 2025 at 02:34:37PM -0400, Nicolas Dufresne wrote:
>> Le jeudi 10 avril 2025 à 16:53 +0200, Bastien Curutchet a écrit :
>>> There is no way to transmit the DMA address of a buffer to userspace.
>>> Some UIO users need this to handle DMA from userspace.
>>
>> To me this API is against all safe practice we've been pushing forward
>> and has no place in DMA_BUF API.
>>
>> If this is fine for the UIO subsystem to pass around physicial
>> addresses, then make this part of the UIO device ioctl.
>
> Yeah, this has no business in dma-buf since the entire point of dma-buf
> was to stop all the nasty "just pass raw dma addr in userspace" hacks that
> preceeded it.
>
> And over the years since dma-buf landed, we've removed a lot of these,
> like dri1 drivers. Or where that's not possible like with fbdev, hid the
> raw dma addr uapi behind a Kconfig.
>
> I concur with the overall sentiment that this should be done in
> vfio/iommufd interfaces, maybe with some support added to map dma-buf. I
> think patches for that have been floating around for a while, but I lost a
> bit the status of where exactly they are.
My take away is that we need to have a documented way for special driver specific interfaces in DMA-buf.
In other words DMA-buf has some standardized rules of doing things which every implementation should follow. The implementations might of course still have bugs (e.g. allocate memory for a dma_fence operation), but at least we have documented what should be done and what's forbidden.
What is still missing in the documentation is the use case when you have for example vfio which wants to talk to iommufd through a specialized interface. This doesn't necessarily needs to be part of DMA-buf, but we should still document "do it this way" because that has already worked in the last ten use cases and we don't want people to re-invent the wheel in a new funky way which then later turns out to not work.
Regards,
Christian.
>
> Cheers, Sima
>
>>
>> regards,
>> Nicolas
>>
>>>
>>> Add a new dma_buf_ops operation that returns the DMA address.
>>> Add a new ioctl to transmit this DMA address to userspace.
>>>
>>> Signed-off-by: Bastien Curutchet <bastien.curutchet(a)bootlin.com>
>>> ---
>>> drivers/dma-buf/dma-buf.c | 21 +++++++++++++++++++++
>>> include/linux/dma-buf.h | 1 +
>>> include/uapi/linux/dma-buf.h | 1 +
>>> 3 files changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index
>>> 398418bd9731ad7a3a1f12eaea6a155fa77a22fe..cbbb518981e54e50f479c3d1fcf
>>> 6da6971f639c1 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -454,6 +454,24 @@ static long dma_buf_import_sync_file(struct
>>> dma_buf *dmabuf,
>>> }
>>> #endif
>>>
>>> +static int dma_buf_get_dma_addr(struct dma_buf *dmabuf, u64 __user
>>> *arg)
>>> +{
>>> + u64 addr;
>>> + int ret;
>>> +
>>> + if (!dmabuf->ops->get_dma_addr)
>>> + return -EINVAL;
>>> +
>>> + ret = dmabuf->ops->get_dma_addr(dmabuf, &addr);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (copy_to_user(arg, &addr, sizeof(u64)))
>>> + return -EFAULT;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static long dma_buf_ioctl(struct file *file,
>>> unsigned int cmd, unsigned long arg)
>>> {
>>> @@ -504,6 +522,9 @@ static long dma_buf_ioctl(struct file *file,
>>> return dma_buf_import_sync_file(dmabuf, (const void
>>> __user *)arg);
>>> #endif
>>>
>>> + case DMA_BUF_IOCTL_GET_DMA_ADDR:
>>> + return dma_buf_get_dma_addr(dmabuf, (u64 __user
>>> *)arg);
>>> +
>>> default:
>>> return -ENOTTY;
>>> }
>>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>>> index
>>> 36216d28d8bdc01a9c9c47e27c392413f7f6c5fb..ed4bf15d3ce82e7a86323fff459
>>> 699a9bc8baa3b 100644
>>> --- a/include/linux/dma-buf.h
>>> +++ b/include/linux/dma-buf.h
>>> @@ -285,6 +285,7 @@ struct dma_buf_ops {
>>>
>>> int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
>>> void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map
>>> *map);
>>> + int (*get_dma_addr)(struct dma_buf *dmabuf, u64 *addr);
>>> };
>>>
>>> /**
>>> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-
>>> buf.h
>>> index
>>> 5a6fda66d9adf01438619e7e67fa69f0fec2d88d..f3aba46942042de6a2e3a4cca3e
>>> b3f87175e29c9 100644
>>> --- a/include/uapi/linux/dma-buf.h
>>> +++ b/include/uapi/linux/dma-buf.h
>>> @@ -178,5 +178,6 @@ struct dma_buf_import_sync_file {
>>> #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, __u64)
>>> #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2,
>>> struct dma_buf_export_sync_file)
>>> #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct
>>> dma_buf_import_sync_file)
>>> +#define DMA_BUF_IOCTL_GET_DMA_ADDR _IOR(DMA_BUF_BASE, 4, __u64
>>> *)
>>>
>>> #endif
>
From: Rob Clark <robdclark(a)chromium.org>
Conversion to DRM GPU VA Manager[1], and adding support for Vulkan Sparse
Memory[2] in the form of:
1. A new VM_BIND submitqueue type for executing VM MSM_SUBMIT_BO_OP_MAP/
MAP_NULL/UNMAP commands
2. A new VM_BIND ioctl to allow submitting batches of one or more
MAP/MAP_NULL/UNMAP commands to a VM_BIND submitqueue
I did not implement support for synchronous VM_BIND commands. Since
userspace could just immediately wait for the `SUBMIT` to complete, I don't
think we need this extra complexity in the kernel. Synchronous/immediate
VM_BIND operations could be implemented with a 2nd VM_BIND submitqueue.
The corresponding mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32533
Changes in v3:
- Switched to seperate VM_BIND ioctl. This makes the UABI a bit
cleaner, but OTOH the userspace code was cleaner when the end result
of either type of VkQueue lead to the same ioctl. So I'm a bit on
the fence.
- Switched to doing the gpuvm bookkeeping synchronously, and only
deferring the pgtable updates. This avoids needing to hold any resv
locks in the fence signaling path, resolving the last shrinker related
lockdep complaints. OTOH it means userspace can trigger invalid
pgtable updates with multiple VM_BIND queues. In this case, we ensure
that unmaps happen completely (to prevent userspace from using this to
access free'd pages), mark the context as unusable, and move on with
life.
- Link to v2: https://lore.kernel.org/all/20250319145425.51935-1-robdclark@gmail.com/
Changes in v2:
- Dropped Bibek Kumar Patro's arm-smmu patches[3], which have since been
merged.
- Pre-allocate all the things, and drop HACK patch which disabled shrinker.
This includes ensuring that vm_bo objects are allocated up front, pre-
allocating VMA objects, and pre-allocating pages used for pgtable updates.
The latter utilizes io_pgtable_cfg callbacks for pgtable alloc/free, that
were initially added for panthor.
- Add back support for BO dumping for devcoredump.
- Link to v1 (RFC): https://lore.kernel.org/dri-devel/20241207161651.410556-1-robdclark@gmail.c…
[1] https://www.kernel.org/doc/html/next/gpu/drm-mm.html#drm-gpuvm
[2] https://docs.vulkan.org/spec/latest/chapters/sparsemem.html
[3] https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=909700
Rob Clark (33):
drm/gpuvm: Don't require obj lock in destructor path
drm/gpuvm: Allow VAs to hold soft reference to BOs
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
drm/msm: Rename msm_file_private -> msm_context
drm/msm: Improve msm_context comments
drm/msm: Rename msm_gem_address_space -> msm_gem_vm
drm/msm: Remove vram carveout support
drm/msm: Collapse vma allocation and initialization
drm/msm: Collapse vma close and delete
drm/msm: Don't close VMAs on purge
drm/msm: drm_gpuvm conversion
drm/msm: Convert vm locking
drm/msm: Use drm_gpuvm types more
drm/msm: Split out helper to get iommu prot flags
drm/msm: Add mmu support for non-zero offset
drm/msm: Add PRR support
drm/msm: Rename msm_gem_vma_purge() -> _unmap()
drm/msm: Lazily create context VM
drm/msm: Add opt-in for VM_BIND
drm/msm: Mark VM as unusable on GPU hangs
drm/msm: Add _NO_SHARE flag
drm/msm: Crashdump prep for sparse mappings
drm/msm: rd dumping prep for sparse mappings
drm/msm: Crashdec support for sparse
drm/msm: rd dumping support for sparse
drm/msm: Extract out syncobj helpers
drm/msm: Use DMA_RESV_USAGE_BOOKKEEP/KERNEL
drm/msm: Add VM_BIND submitqueue
drm/msm: Support IO_PGTABLE_QUIRK_NO_WARN_ON
drm/msm: Support pgtable preallocation
drm/msm: Split out map/unmap ops
drm/msm: Add VM_BIND ioctl
drm/msm: Bump UAPI version
drivers/gpu/drm/drm_gpuvm.c | 15 +-
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/Makefile | 1 +
drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 25 +-
drivers/gpu/drm/msm/adreno/a2xx_gpummu.c | 5 +-
drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 32 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.h | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 49 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 6 +-
drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/adreno_device.c | 4 -
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 88 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 23 +-
.../drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.h | 2 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_kms.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h | 4 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 6 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_kms.c | 28 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_plane.c | 12 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 4 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_kms.c | 19 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_plane.c | 12 +-
drivers/gpu/drm/msm/dsi/dsi_host.c | 14 +-
drivers/gpu/drm/msm/msm_drv.c | 183 +--
drivers/gpu/drm/msm/msm_drv.h | 35 +-
drivers/gpu/drm/msm/msm_fb.c | 18 +-
drivers/gpu/drm/msm/msm_fbdev.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 489 +++----
drivers/gpu/drm/msm/msm_gem.h | 217 ++-
drivers/gpu/drm/msm/msm_gem_prime.c | 15 +
drivers/gpu/drm/msm/msm_gem_shrinker.c | 4 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 295 ++--
drivers/gpu/drm/msm/msm_gem_vma.c | 1265 +++++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 171 ++-
drivers/gpu/drm/msm/msm_gpu.h | 132 +-
drivers/gpu/drm/msm/msm_iommu.c | 298 +++-
drivers/gpu/drm/msm/msm_kms.c | 18 +-
drivers/gpu/drm/msm/msm_kms.h | 2 +-
drivers/gpu/drm/msm/msm_mmu.h | 38 +-
drivers/gpu/drm/msm/msm_rd.c | 62 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 86 +-
drivers/gpu/drm/msm/msm_syncobj.c | 172 +++
drivers/gpu/drm/msm/msm_syncobj.h | 37 +
drivers/iommu/io-pgtable-arm.c | 18 +-
include/drm/drm_gpuvm.h | 12 +-
include/linux/io-pgtable.h | 8 +
include/uapi/drm/msm_drm.h | 149 +-
57 files changed, 3012 insertions(+), 1216 deletions(-)
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.c
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.h
--
2.49.0
On 4/24/25 15:02, Philipp Stanner wrote:
> In nouveau_fence_done(), a fence is checked for being signaled by
> manually evaluating the base fence's bits. This can be done in a
> canonical manner through dma_fence_is_signaled().
>
> Replace the bit-check with dma_fence_is_signaled().
>
> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
I think the bit check was used here as fast path optimization because we later call dma_fence_is_signaled() anyway.
Feel free to add my acked-by, but honestly what nouveau does here looks rather suspicious to me.
Regards,
Christian.
> ---
> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index fb9811938c82..d5654e26d5bc 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -253,7 +253,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> struct nouveau_channel *chan;
> unsigned long flags;
>
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
> + if (dma_fence_is_signaled(&fence->base))
> return true;
>
> spin_lock_irqsave(&fctx->lock, flags);
Hi,
This series is the follow-up of the discussion that John and I had some
time ago here:
https://lore.kernel.org/all/CANDhNCquJn6bH3KxKf65BWiTYLVqSd9892-xtFDHHqqyrr…
The initial problem we were discussing was that I'm currently working on
a platform which has a memory layout with ECC enabled. However, enabling
the ECC has a number of drawbacks on that platform: lower performance,
increased memory usage, etc. So for things like framebuffers, the
trade-off isn't great and thus there's a memory region with ECC disabled
to allocate from for such use cases.
After a suggestion from John, I chose to first start using heap
allocations flags to allow for userspace to ask for a particular ECC
setup. This is then backed by a new heap type that runs from reserved
memory chunks flagged as such, and the existing DT properties to specify
the ECC properties.
After further discussion, it was considered that flags were not the
right solution, and relying on the names of the heaps would be enough to
let userspace know the kind of buffer it deals with.
Thus, even though the uAPI part of it has been dropped in this second
version, we still need a driver to create heaps out of carved-out memory
regions. In addition to the original usecase, a similar driver can be
found in BSPs from most vendors, so I believe it would be a useful
addition to the kernel.
I submitted a draft PR to the DT schema for the bindings used in this
PR:
https://github.com/devicetree-org/dt-schema/pull/138
Let me know what you think,
Maxime
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v3:
- Reworked global variable patch
- Link to v2: https://lore.kernel.org/r/20250401-dma-buf-ecc-heap-v2-0-043fd006a1af@kerne…
Changes in v2:
- Add vmap/vunmap operations
- Drop ECC flags uapi
- Rebase on top of 6.14
- Link to v1: https://lore.kernel.org/r/20240515-dma-buf-ecc-heap-v1-0-54cbbd049511@kerne…
---
Maxime Ripard (2):
dma-buf: heaps: system: Remove global variable
dma-buf: heaps: Introduce a new heap for reserved memory
drivers/dma-buf/heaps/Kconfig | 8 +
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/carveout_heap.c | 360 ++++++++++++++++++++++++++++++++++
drivers/dma-buf/heaps/system_heap.c | 3 +-
4 files changed, 370 insertions(+), 2 deletions(-)
---
base-commit: fcbf30774e82a441890b722bf0c26542fb82150f
change-id: 20240515-dma-buf-ecc-heap-28a311d2c94e
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Tue, Apr 22, 2025 at 4:01 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Tue, Apr 22, 2025 at 12:57 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
> > <alexei.starovoitov(a)gmail.com> wrote:
> > >
> > > On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > > > new file mode 100644
> > > > > > index 000000000000..b4b8be1d6aa4
> > > > > > --- /dev/null
> > > > > > +++ b/kernel/bpf/dmabuf_iter.c
> > > > >
> > > > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > > > hear other folks thoughts on this.
> > > >
> > > > This is fine with me, and would save us the extra
> > > > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > > > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > > > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > > > implementation into drivers/dma-buf?
> > >
> > > The driver directory would need to 'depends on BPF_SYSCALL'.
> > > Are you sure you want this?
> > > imo kernel/bpf/ is fine for this.
> >
> > I don't have a strong preference so either way is fine with me. The
> > main difference I see is maintainership.
> >
> > > You also probably want
> > > .feature = BPF_ITER_RESCHED
> > > in bpf_dmabuf_reg_info.
> >
> > Thank you, this looks like a good idea.
> >
> > > Also have you considered open coded iterator for dmabufs?
> > > Would it help with the interface to user space?
> >
> > I read through the open coded iterator patches, and it looks like they
> > would be slightly more efficient by avoiding seq_file overhead. As far
> > as the interface to userspace, for the purpose of replacing what's
> > currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
> > a difference. However it looks like if I were to try to replace all of
> > our userspace analysis of dmabufs with a single bpf program then an
> > open coded iterator would make that much easier. I had not considered
> > attempting that.
> >
> > One problem I see with open coded iterators is that support is much
> > more recent (2023 vs 2020). We support longterm stable kernels (back
> > to 5.4 currently but probably 5.10 by the time this would be used), so
> > it seems like it would be harder to backport the kernel support for an
> > open-coded iterator that far since it only goes back as far as 6.6
> > now. Actually it doesn't look like it is possible while also
> > maintaining the stable ABI we provide to device vendors. Which means
> > we couldn't get rid of the dmabuf sysfs stats userspace dependency
> > until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
> > iterator here for now.
>
> Fair enough, but please implement both and backport only
> the old style pinned iterator.
Ok, will do.
> The code will be mostly shared between them.
> bpf_iter_dmabuf_new/_next will be more flexible with more
> options to return data to user space. Like android can invent
> their own binary format. Pack into it in a bpf prog, send to
> bpf ringbuf and unmarshal efficiently in user space.
> Instead of being limited to text output that pinned iterators
> are supposed to do usually.
Also a neat idea!
> You can do binary with bpf_seq_write() too, but it's rare.
>
> Also please provide full bpf prog that you'll use in production
> in a selftest instead of trivial:
> +SEC("iter/dmabuf")
> +int dmabuf_collector(struct bpf_iter__dmabuf *ctx)
>
> just to make sure it's tested end to end and future changes
> won't break it.
The final bpf program should be something pretty close to that, but
I'll start working on the AOSP side as well so I can put up patches.
>
> pw-bot: cr
Hello Jared,
On Wed, 23 Apr 2025 at 00:49, Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Hi all,
>
> This patch series is based on a previous discussion around CMA heap
> naming. [1] The heap's name depends on the device name, which is
> generally "reserved", "linux,cma", or "default-pool", but could be any
> arbitrary name given to the default CMA area in the devicetree. For a
> consistent userspace interface, the series introduces a constant name
> for the CMA heap, and for backwards compatibility, an additional Kconfig
> that controls the creation of a legacy-named heap with the same CMA
> backing.
>
> The ideas to handle backwards compatibility in [1] are to either use a
> symlink or add a heap node with a duplicate minor. However, I assume
> that we don't want to create symlinks in /dev from module initcalls, and
> attempting to duplicate minors would cause device_create() to fail.
> Because of these drawbacks, after brainstorming with Maxime Ripard, I
> went with creating a new node in devtmpfs with its own minor. This
> admittedly makes it a little unclear that the old and new nodes are
> backed by the same heap when both are present. The only approach that I
> think would provide total clarity on this in userspace is symlinking,
> which seemed like a fairly involved solution for devtmpfs, but if I'm
> wrong on this, please let me know.
Thanks indeed for this patch; just one minor nit: the link referred to
as [1] here seems to be missing. Could you please add it? This would
make it easier to follow the chain of discussion in posterity.
>
> Changelog:
> v2: Use tabs instead of spaces for large vertical alignment.
>
> Jared Kangas (2):
> dma-buf: heaps: Parameterize heap name in __add_cma_heap()
> dma-buf: heaps: Give default CMA heap a fixed name
>
> Documentation/userspace-api/dma-buf-heaps.rst | 11 ++++---
> drivers/dma-buf/heaps/Kconfig | 10 +++++++
> drivers/dma-buf/heaps/cma_heap.c | 30 ++++++++++++++-----
> 3 files changed, 40 insertions(+), 11 deletions(-)
>
> --
> 2.49.0
>
Best,
Sumit
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> The CMA heap's name in devtmpfs can vary depending on how the heap is
> defined. Its name defaults to "reserved", but if a CMA area is defined
> in the devicetree, the heap takes on the devicetree node's name, such as
> "default-pool" or "linux,cma". To simplify naming, just name it
> "default_cma", and keep a legacy node in place backed by the same
> underlying structure for backwards compatibility.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Once again, thanks for working out how to improve the standard naming
while keeping compatibility.
I do still hope we can get to the point where other cma regions can be
registered as heaps with unique/purpose-specific names, but I can see
having a standard name for the default region is a nice improvement.
Acked-by: John Stultz <jstultz(a)google.com>
thanks
-john
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Prepare for the introduction of a fixed-name CMA heap by replacing the
> unused void pointer parameter in __add_cma_heap() with the heap name.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Thanks so much for taking this effort on. Looks good to me!
Acked-by: John Stultz <jstultz(a)google.com>
On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > > > new file mode 100644
> > > > index 000000000000..b4b8be1d6aa4
> > > > --- /dev/null
> > > > +++ b/kernel/bpf/dmabuf_iter.c
> > >
> > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > hear other folks thoughts on this.
> >
> > This is fine with me, and would save us the extra
> > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > implementation into drivers/dma-buf?
>
> The driver directory would need to 'depends on BPF_SYSCALL'.
> Are you sure you want this?
> imo kernel/bpf/ is fine for this.
I don't have a strong preference so either way is fine with me. The
main difference I see is maintainership.
> You also probably want
> .feature = BPF_ITER_RESCHED
> in bpf_dmabuf_reg_info.
Thank you, this looks like a good idea.
> Also have you considered open coded iterator for dmabufs?
> Would it help with the interface to user space?
I read through the open coded iterator patches, and it looks like they
would be slightly more efficient by avoiding seq_file overhead. As far
as the interface to userspace, for the purpose of replacing what's
currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
a difference. However it looks like if I were to try to replace all of
our userspace analysis of dmabufs with a single bpf program then an
open coded iterator would make that much easier. I had not considered
attempting that.
One problem I see with open coded iterators is that support is much
more recent (2023 vs 2020). We support longterm stable kernels (back
to 5.4 currently but probably 5.10 by the time this would be used), so
it seems like it would be harder to backport the kernel support for an
open-coded iterator that far since it only goes back as far as 6.6
now. Actually it doesn't look like it is possible while also
maintaining the stable ABI we provide to device vendors. Which means
we couldn't get rid of the dmabuf sysfs stats userspace dependency
until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
iterator here for now.
On Mon, Apr 14, 2025 at 09:21:25PM +0200, Thomas Petazzoni wrote:
> > "UIO is a broken legacy mess, so let's add more broken things
> > to it as broken + broken => still broken, so no harm done", am I
> > getting that right?
>
> Who says UIO is a "broken legacy mess"? Only you says so. I don't see
> any indication anywhere in the kernel tree suggesting that UIO is
> considered a broken legacy mess.
Explain what the difference is between UIO and VFIO, especially VFIO
no-iommu mode?
I've always understood that UIO is for very simple devices that cannot
do DMA. So it's very simple operating model and simple security work
fine.
IMHO, if the can use DMA it should use VFIO. If you have no iommu then
you should use the VFIO unsafe no-iommu path. It still provides a
solid framework.
As to this series, I have seen a number of requests to improve the
VFIO no-iommu path to allow working with the existing IOAS scheme to
register memory but to allow the kernel the return the no-iommu
DMAable address of the IOAS pinned memory. This would replace the
hacky use of mlock and /proc/XX/pagemap that people use today.
If that were done, could you use VFIO no-iommu?
> Keep in mind that when you're running code as root, you can load a
> kernel module, which can do anything on the system security-wise. So
> letting UIO expose MMIO registers of devices to userspace applications
> running as root is not any worse than that.
That isn't fully true.. UIO isn't fitting into the security model by
allowing DMA capable devices to be exposed without checking for
CAP_SYS_RAW_IO first.
Jason
On Mon, Apr 21, 2025 at 10:58 AM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 3:53 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > The dmabuf iterator traverses the list of all DMA buffers. The list is
> > maintained only when CONFIG_DEBUG_FS is enabled.
> >
> > DMA buffers are refcounted through their associated struct file. A
> > reference is taken on each buffer as the list is iterated to ensure each
> > buffer persists for the duration of the bpf program execution without
> > holding the list mutex.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > ---
> > include/linux/btf_ids.h | 1 +
> > kernel/bpf/Makefile | 3 +
> > kernel/bpf/dmabuf_iter.c | 130 +++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 134 insertions(+)
> > create mode 100644 kernel/bpf/dmabuf_iter.c
> >
> > diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> > index 139bdececdcf..899ead57d89d 100644
> > --- a/include/linux/btf_ids.h
> > +++ b/include/linux/btf_ids.h
> > @@ -284,5 +284,6 @@ extern u32 bpf_cgroup_btf_id[];
> > extern u32 bpf_local_storage_map_btf_id[];
> > extern u32 btf_bpf_map_id[];
> > extern u32 bpf_kmem_cache_btf_id[];
> > +extern u32 bpf_dmabuf_btf_id[];
>
> This is not necessary. See below.
>
> >
> > #endif
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 70502f038b92..5b30d37ef055 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -53,6 +53,9 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o
> > obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o
> > +ifeq ($(CONFIG_DEBUG_FS),y)
> > +obj-$(CONFIG_BPF_SYSCALL) += dmabuf_iter.o
> > +endif
> >
> > CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE)
> > CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE)
> > diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..b4b8be1d6aa4
> > --- /dev/null
> > +++ b/kernel/bpf/dmabuf_iter.c
>
> Maybe we should add this file to drivers/dma-buf. I would like to
> hear other folks thoughts on this.
This is fine with me, and would save us the extra
CONFIG_DMA_SHARED_BUFFER check that's currently needed in
kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
Sumit / Christian any objections to moving the dmabuf bpf iterator
implementation into drivers/dma-buf?
> > @@ -0,0 +1,130 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2025 Google LLC */
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/kernel.h>
> > +#include <linux/seq_file.h>
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf)
>
> Use BTF_ID_LIST_SINGLE(), then we don't need this in btf_ids.h
>
> > +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf)
> > +
> > +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct dma_buf *dmabuf, *ret = NULL;
> > +
> > + if (*pos) {
> > + *pos = 0;
> > + return NULL;
> > + }
> > + /* Look for the first buffer we can obtain a reference to.
> > + * The list mutex does not protect a dmabuf's refcount, so it can be
> > + * zeroed while we are iterating. Therefore we cannot call get_dma_buf()
> > + * since the caller of this program may not already own a reference to
> > + * the buffer.
> > + */
>
> We should use kernel comment style for new code. IOW,
> /*
> * Look for ...
> */
>
>
> Thanks,
> Song
Thanks, I have incorporated all of your comments and retested. I will
give some time for more feedback before sending a v2.
>
> [...]
On Thu, Apr 17, 2025 at 1:26 PM Song Liu <song(a)kernel.org> wrote:
>
> On Thu, Apr 17, 2025 at 9:05 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> > > [...]
> > > > >
> > > > > Here is another rookie question, it appears to me there is a file descriptor
> > > > > associated with each DMA buffer, can we achieve the same goal with
> > > > > a task-file iterator?
> > > >
> > > > That would find almost all of them, but not the kernel-only
> > > > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > > > If there's a leak, it's likely to show up in kernel_rss because some
> > > > driver forgot to release its reference(s).) Also wouldn't that be a
> > > > ton more iterations since we'd have to visit every FD to find the
> > > > small portion that are dmabufs? I'm not actually sure if buffers that
> > > > have been mapped, and then have had their file descriptors closed
> > > > would show up in task_struct->files; if not I think that would mean
> > > > scanning both files and vmas for each task.
> > >
> > > I don't think scanning all FDs to find a small portion of specific FDs
> > > is a real issue. We have a tool that scans all FDs in the system and
> > > only dump data for perf_event FDs. I think it should be easy to
> > > prototype a tool by scanning all files and all vmas. If that turns out
> > > to be very slow, which I highly doubt will be, we can try other
> > > approaches.
> >
> > But this will not find *all* the buffers, and that defeats the purpose
> > of having the iterator.
>
> Do you mean this approach cannot get kernel only allocations? If
> that's the case, we probably do need a separate iterator. I am
> interested in other folks thoughts on this.
Correct.
> > > OTOH, I am wondering whether we can build a more generic iterator
> > > for a list of objects. Adding a iterator for each important kernel lists
> > > seems not scalable in the long term.
> >
> > I think the wide variety of differences in locking for different
> > objects would make this difficult to do in a generic way.
>
> Agreed it is not easy to build a generic solution. But with the
> help from BTF, we can probably build something that covers
> a large number of use cases.
I'm curious what this would look like. I guess a good test would be to
see if anything would work for some subset of the already existing
iterators.
On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> [...]
> > >
> > > Here is another rookie question, it appears to me there is a file descriptor
> > > associated with each DMA buffer, can we achieve the same goal with
> > > a task-file iterator?
> >
> > That would find almost all of them, but not the kernel-only
> > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > If there's a leak, it's likely to show up in kernel_rss because some
> > driver forgot to release its reference(s).) Also wouldn't that be a
> > ton more iterations since we'd have to visit every FD to find the
> > small portion that are dmabufs? I'm not actually sure if buffers that
> > have been mapped, and then have had their file descriptors closed
> > would show up in task_struct->files; if not I think that would mean
> > scanning both files and vmas for each task.
>
> I don't think scanning all FDs to find a small portion of specific FDs
> is a real issue. We have a tool that scans all FDs in the system and
> only dump data for perf_event FDs. I think it should be easy to
> prototype a tool by scanning all files and all vmas. If that turns out
> to be very slow, which I highly doubt will be, we can try other
> approaches.
But this will not find *all* the buffers, and that defeats the purpose
of having the iterator.
> OTOH, I am wondering whether we can build a more generic iterator
> for a list of objects. Adding a iterator for each important kernel lists
> seems not scalable in the long term.
I think the wide variety of differences in locking for different
objects would make this difficult to do in a generic way.
Am 16.04.25 um 23:51 schrieb Juan Yescas:
> On Wed, Apr 16, 2025 at 4:34 AM Christian König
> <christian.koenig(a)amd.com> wrote:
>>
>>
>> Am 15.04.25 um 19:19 schrieb Juan Yescas:
>>> This change sets the allocation orders for the different page sizes
>>> (4k, 16k, 64k) based on PAGE_SHIFT. Before this change, the orders
>>> for large page sizes were calculated incorrectly, this caused system
>>> heap to allocate from 2% to 4% more memory on 16KiB page size kernels.
>>>
>>> This change was tested on 4k/16k page size kernels.
>>>
>>> Signed-off-by: Juan Yescas <jyescas(a)google.com>
>>> ---
>>> drivers/dma-buf/heaps/system_heap.c | 9 ++++++++-
>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index 26d5dc89ea16..54674c02dcb4 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -50,8 +50,15 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
>>> * to match with the sizes often found in IOMMUs. Using order 4 pages instead
>>> * of order 0 pages can significantly improve the performance of many IOMMUs
>>> * by reducing TLB pressure and time spent updating page tables.
>>> + *
>>> + * Note: When the order is 0, the minimum allocation is PAGE_SIZE. The possible
>>> + * page sizes for ARM devices could be 4K, 16K and 64K.
>>> */
>>> -static const unsigned int orders[] = {8, 4, 0};
>>> +#define ORDER_1M (20 - PAGE_SHIFT)
>>> +#define ORDER_64K (16 - PAGE_SHIFT)
>>> +#define ORDER_FOR_PAGE_SIZE (0)
>>> +static const unsigned int orders[] = {ORDER_1M, ORDER_64K, ORDER_FOR_PAGE_SIZE};
>>> +#
>> Good catch, but I think the defines are just overkill.
>>
>> What you should do instead is to subtract page shift when using the array.
>>
> There are several occurrences of orders in the file and I think it is
> better to do the calculations up front in the array. Would you be ok
> if we get rid of the defines as per your suggestion and make the
> calculations during the definition of the array. Something like this:
>
> static const unsigned int orders[] = {20 - PAGE_SHIFT, 16 - PAGE_SHIFT, 0};
Yeah that totally works for me as well. Just make sure that a code comment nearby explains why 20, 16 and 0.
>> Apart from that using 1M, 64K and then falling back to 4K just sounds random to me. We have especially pushed back on 64K more than once because it is actually not beneficial in almost all cases.
>>
> In the hardware where the driver is used, the 64K is beneficial for:
>
> Arm SMMUv3 (4KB granule size):
> 64KB (contiguous bit groups 16 4KB pages)
>
> SysMMU benefits from 64KB (“large” page) on 4k/16k page sizes.
Question could this HW also work with pages larger than 64K? (I strongly expect yes).
>> I suggest to fix the code in system_heap_allocate to not over allocate instead and just try the available orders like TTM does. This has proven to be working architecture independent.
>>
> Do you mean to have an implementation similar to __ttm_pool_alloc()?
Yeah something like that.
> https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree…
>
> If that is the case, we can try the change, run some benchmarks and
> submit the patch if we see positive results.
Could be that this doesn't matter for your platform, but it's minimal extra overhead asking for larger chunks first and it just avoids fragmenting everything into 64K chunks.
That is kind of important since the code in DMA-heaps should be platform neutral if possible.
Regards,
Christian.
>
> Thanks
> Juan
>
>> Regards,
>> Christian.
>>
>>> #define NUM_ORDERS ARRAY_SIZE(orders)
>>>
>>> static struct sg_table *dup_sg_table(struct sg_table *table)
On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 4:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song(a)kernel.org> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > [...]
> > > > >
> > > > > IIUC, the iterator simply traverses elements in a linked list. I feel it is
> > > > > an overkill to implement a new BPF iterator for it.
> > > >
> > > > Like other BPF iterators such as kmem_cache_iter or task_iter.
> > > > Cgroup_iter iterates trees instead of lists. This is iterating over
> > > > kernel objects just like the docs say, "A BPF iterator is a type of
> > > > BPF program that allows users to iterate over specific types of kernel
> > > > objects". More complicated iteration should not be a requirement here.
> > > >
> > > > > Maybe we simply
> > > > > use debugging tools like crash or drgn for this? The access with
> > > > > these tools will not be protected by the mutex. But from my personal
> > > > > experience, this is not a big issue for user space debugging tools.
> > > >
> > > > drgn is *way* too slow, and even if it weren't the dependencies for
> > > > running it aren't available. crash needs debug symbols which also
> > > > aren't available on user builds. This is not just for manual
> > > > debugging, it's for reporting memory use in production. Or anything
> > > > else someone might care to extract like attachment info or refcounts.
> > >
> > > Could you please share more information about the use cases and
> > > the time constraint here, and why drgn is too slow. Is most of the delay
> > > comes from parsing DWARF? This is mostly for my curiosity, because
> > > I have been thinking about using drgn to do some monitoring in
> > > production.
> > >
> > > Thanks,
> > > Song
> >
> > These RunCommands have 10 second timeouts for example. It's rare that
> > I see them get exceeded but it happens occasionally.:
> > https://cs.android.com/android/platform/superproject/main/+/main:frameworks…
>
> Thanks for sharing this information.
>
> > The last time I used drgn (admittedly back in 2023) it took over a
> > minute to iterate through less than 200 cgroups. I'm not sure what the
> > root cause of the slowness was, but I'd expect the DWARF processing to
> > be done up-front once and the slowness I experienced was not just at
> > startup. Eventually I switched over to tracefs for that issue, which
> > we still use for some telemetry.
>
> I haven't tried drgn on Android. On server side, iterating should 200
> cgroups should be fairly fast (< 5 seconds, where DWARF parsing is
> the most expensive part).
>
> > Other uses are by statsd for telemetry, memory reporting on app kills
> > or death, and for "dumpsys meminfo".
>
> Here is another rookie question, it appears to me there is a file descriptor
> associated with each DMA buffer, can we achieve the same goal with
> a task-file iterator?
That would find almost all of them, but not the kernel-only
allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
If there's a leak, it's likely to show up in kernel_rss because some
driver forgot to release its reference(s).) Also wouldn't that be a
ton more iterations since we'd have to visit every FD to find the
small portion that are dmabufs? I'm not actually sure if buffers that
have been mapped, and then have had their file descriptors closed
would show up in task_struct->files; if not I think that would mean
scanning both files and vmas for each task.
On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> [...]
> > >
> > > IIUC, the iterator simply traverses elements in a linked list. I feel it is
> > > an overkill to implement a new BPF iterator for it.
> >
> > Like other BPF iterators such as kmem_cache_iter or task_iter.
> > Cgroup_iter iterates trees instead of lists. This is iterating over
> > kernel objects just like the docs say, "A BPF iterator is a type of
> > BPF program that allows users to iterate over specific types of kernel
> > objects". More complicated iteration should not be a requirement here.
> >
> > > Maybe we simply
> > > use debugging tools like crash or drgn for this? The access with
> > > these tools will not be protected by the mutex. But from my personal
> > > experience, this is not a big issue for user space debugging tools.
> >
> > drgn is *way* too slow, and even if it weren't the dependencies for
> > running it aren't available. crash needs debug symbols which also
> > aren't available on user builds. This is not just for manual
> > debugging, it's for reporting memory use in production. Or anything
> > else someone might care to extract like attachment info or refcounts.
>
> Could you please share more information about the use cases and
> the time constraint here, and why drgn is too slow. Is most of the delay
> comes from parsing DWARF? This is mostly for my curiosity, because
> I have been thinking about using drgn to do some monitoring in
> production.
>
> Thanks,
> Song
These RunCommands have 10 second timeouts for example. It's rare that
I see them get exceeded but it happens occasionally.:
https://cs.android.com/android/platform/superproject/main/+/main:frameworks…
The last time I used drgn (admittedly back in 2023) it took over a
minute to iterate through less than 200 cgroups. I'm not sure what the
root cause of the slowness was, but I'd expect the DWARF processing to
be done up-front once and the slowness I experienced was not just at
startup. Eventually I switched over to tracefs for that issue, which
we still use for some telemetry.
Other uses are by statsd for telemetry, memory reporting on app kills
or death, and for "dumpsys meminfo".
Thanks,
T.J.
On Wed, Apr 16, 2025 at 3:02 PM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 3:53 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> [...]
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf)
> > +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf)
> > +
> > +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct dma_buf *dmabuf, *ret = NULL;
> > +
> > + if (*pos) {
> > + *pos = 0;
> > + return NULL;
> > + }
> > + /* Look for the first buffer we can obtain a reference to.
> > + * The list mutex does not protect a dmabuf's refcount, so it can be
> > + * zeroed while we are iterating. Therefore we cannot call get_dma_buf()
> > + * since the caller of this program may not already own a reference to
> > + * the buffer.
> > + */
> > + mutex_lock(&dmabuf_debugfs_list_mutex);
> > + list_for_each_entry(dmabuf, &dmabuf_debugfs_list, list_node) {
> > + if (file_ref_get(&dmabuf->file->f_ref)) {
> > + ret = dmabuf;
> > + break;
> > + }
> > + }
> > + mutex_unlock(&dmabuf_debugfs_list_mutex);
>
> IIUC, the iterator simply traverses elements in a linked list. I feel it is
> an overkill to implement a new BPF iterator for it.
Like other BPF iterators such as kmem_cache_iter or task_iter.
Cgroup_iter iterates trees instead of lists. This is iterating over
kernel objects just like the docs say, "A BPF iterator is a type of
BPF program that allows users to iterate over specific types of kernel
objects". More complicated iteration should not be a requirement here.
> Maybe we simply
> use debugging tools like crash or drgn for this? The access with
> these tools will not be protected by the mutex. But from my personal
> experience, this is not a big issue for user space debugging tools.
drgn is *way* too slow, and even if it weren't the dependencies for
running it aren't available. crash needs debug symbols which also
aren't available on user builds. This is not just for manual
debugging, it's for reporting memory use in production. Or anything
else someone might care to extract like attachment info or refcounts.
> Thanks,
> Song
>
>
> > +
> > + return ret;
> > +}
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. While the kernel must have CONFIG_DEBUG_FS for
the dmabuf_iter to be available, debugfs does not need to be mounted.
The BPF program loaded by userspace that extracts per-buffer information
gets to define its own interface which avoids the lack of ABI stability
with debugfs (even if it were mounted).
As this is a replacement for our use of CONFIG_DMABUF_SYSFS_STATS, the
last patch is a RFC for removing it from the kernel. Please see my
suggestion there regarding the timeline for that.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com/
T.J. Mercier (4):
dma-buf: Rename and expose debugfs symbols
bpf: Add dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
RFC: dma-buf: Remove DMA-BUF statistics
.../ABI/testing/sysfs-kernel-dmabuf-buffers | 24 ---
Documentation/driver-api/dma-buf.rst | 5 -
drivers/dma-buf/Kconfig | 15 --
drivers/dma-buf/Makefile | 1 -
drivers/dma-buf/dma-buf-sysfs-stats.c | 202 ------------------
drivers/dma-buf/dma-buf-sysfs-stats.h | 35 ---
drivers/dma-buf/dma-buf.c | 40 +---
include/linux/btf_ids.h | 1 +
include/linux/dma-buf.h | 6 +
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 130 +++++++++++
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 116 ++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 31 +++
14 files changed, 299 insertions(+), 311 deletions(-)
delete mode 100644 Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.c
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.h
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
--
2.49.0.604.gff1f9ca942-goog