From: Rob Clark <robdclark(a)chromium.org>
Conversion to DRM GPU VA Manager[1], and adding support for Vulkan Sparse
Memory[2] in the form of:
1. A new VM_BIND submitqueue type for executing VM MSM_SUBMIT_BO_OP_MAP/
MAP_NULL/UNMAP commands
2. A new VM_BIND ioctl to allow submitting batches of one or more
MAP/MAP_NULL/UNMAP commands to a VM_BIND submitqueue
I did not implement support for synchronous VM_BIND commands. Since
userspace could just immediately wait for the `SUBMIT` to complete, I don't
think we need this extra complexity in the kernel. Synchronous/immediate
VM_BIND operations could be implemented with a 2nd VM_BIND submitqueue.
The corresponding mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32533
Changes in v4:
- Replace selftests_running flag with IO_PGTABLE_QUIRK_NO_WARN_ON [Robin
Murphy]
- Rework msm_gem_vm_sm_step_remap() for cases that orig_vma is evicted
to solve some crashes
- Block when drm_file is closed until pending VM_BIND ops complete, before
tearing down the VM's scheduler, to solve some memory leaks.
- Link to v3: https://lore.kernel.org/all/20250428205619.227835-1-robdclark@gmail.com/
Changes in v3:
- Switched to separate VM_BIND ioctl. This makes the UABI a bit
cleaner, but OTOH the userspace code was cleaner when the end result
of either type of VkQueue lead to the same ioctl. So I'm a bit on
the fence.
- Switched to doing the gpuvm bookkeeping synchronously, and only
deferring the pgtable updates. This avoids needing to hold any resv
locks in the fence signaling path, resolving the last shrinker related
lockdep complaints. OTOH it means userspace can trigger invalid
pgtable updates with multiple VM_BIND queues. In this case, we ensure
that unmaps happen completely (to prevent userspace from using this to
access free'd pages), mark the context as unusable, and move on with
life.
- Link to v2: https://lore.kernel.org/all/20250319145425.51935-1-robdclark@gmail.com/
Changes in v2:
- Dropped Bibek Kumar Patro's arm-smmu patches[3], which have since been
merged.
- Pre-allocate all the things, and drop HACK patch which disabled shrinker.
This includes ensuring that vm_bo objects are allocated up front, pre-
allocating VMA objects, and pre-allocating pages used for pgtable updates.
The latter utilizes io_pgtable_cfg callbacks for pgtable alloc/free, that
were initially added for panthor.
- Add back support for BO dumping for devcoredump.
- Link to v1 (RFC): https://lore.kernel.org/dri-devel/20241207161651.410556-1-robdclark@gmail.c…
[1] https://www.kernel.org/doc/html/next/gpu/drm-mm.html#drm-gpuvm
[2] https://docs.vulkan.org/spec/latest/chapters/sparsemem.html
[3] https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=909700
Rob Clark (33):
drm/gpuvm: Don't require obj lock in destructor path
drm/gpuvm: Allow VAs to hold soft reference to BOs
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
drm/msm: Rename msm_file_private -> msm_context
drm/msm: Improve msm_context comments
drm/msm: Rename msm_gem_address_space -> msm_gem_vm
drm/msm: Remove vram carveout support
drm/msm: Collapse vma allocation and initialization
drm/msm: Collapse vma close and delete
drm/msm: Don't close VMAs on purge
drm/msm: drm_gpuvm conversion
drm/msm: Convert vm locking
drm/msm: Use drm_gpuvm types more
drm/msm: Split out helper to get iommu prot flags
drm/msm: Add mmu support for non-zero offset
drm/msm: Add PRR support
drm/msm: Rename msm_gem_vma_purge() -> _unmap()
drm/msm: Lazily create context VM
drm/msm: Add opt-in for VM_BIND
drm/msm: Mark VM as unusable on GPU hangs
drm/msm: Add _NO_SHARE flag
drm/msm: Crashdump prep for sparse mappings
drm/msm: rd dumping prep for sparse mappings
drm/msm: Crashdec support for sparse
drm/msm: rd dumping support for sparse
drm/msm: Extract out syncobj helpers
drm/msm: Use DMA_RESV_USAGE_BOOKKEEP/KERNEL
drm/msm: Add VM_BIND submitqueue
drm/msm: Support IO_PGTABLE_QUIRK_NO_WARN_ON
drm/msm: Support pgtable preallocation
drm/msm: Split out map/unmap ops
drm/msm: Add VM_BIND ioctl
drm/msm: Bump UAPI version
drivers/gpu/drm/drm_gpuvm.c | 15 +-
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/Makefile | 1 +
drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 25 +-
drivers/gpu/drm/msm/adreno/a2xx_gpummu.c | 5 +-
drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 32 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.h | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 49 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 6 +-
drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/adreno_device.c | 4 -
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 88 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 23 +-
.../drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.h | 2 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_kms.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h | 4 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 6 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_kms.c | 28 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_plane.c | 12 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 4 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_kms.c | 19 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_plane.c | 12 +-
drivers/gpu/drm/msm/dsi/dsi_host.c | 14 +-
drivers/gpu/drm/msm/msm_drv.c | 183 +--
drivers/gpu/drm/msm/msm_drv.h | 35 +-
drivers/gpu/drm/msm/msm_fb.c | 18 +-
drivers/gpu/drm/msm/msm_fbdev.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 489 +++----
drivers/gpu/drm/msm/msm_gem.h | 217 ++-
drivers/gpu/drm/msm/msm_gem_prime.c | 15 +
drivers/gpu/drm/msm/msm_gem_shrinker.c | 4 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 295 ++--
drivers/gpu/drm/msm/msm_gem_vma.c | 1288 +++++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 171 ++-
drivers/gpu/drm/msm/msm_gpu.h | 132 +-
drivers/gpu/drm/msm/msm_iommu.c | 298 +++-
drivers/gpu/drm/msm/msm_kms.c | 18 +-
drivers/gpu/drm/msm/msm_kms.h | 2 +-
drivers/gpu/drm/msm/msm_mmu.h | 38 +-
drivers/gpu/drm/msm/msm_rd.c | 62 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 96 +-
drivers/gpu/drm/msm/msm_syncobj.c | 172 +++
drivers/gpu/drm/msm/msm_syncobj.h | 37 +
drivers/iommu/io-pgtable-arm.c | 27 +-
include/drm/drm_gpuvm.h | 12 +-
include/linux/io-pgtable.h | 8 +
include/uapi/drm/msm_drm.h | 149 +-
57 files changed, 3043 insertions(+), 1227 deletions(-)
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.c
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.h
--
2.49.0
Hi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on jic23-iio/togreg]
[also build test WARNING on char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus usb/usb-testing usb/usb-next usb/usb-linus xen-tip/linux-next linus/master v6.15-rc4]
[cannot apply to tegra/for-next drm-xe/drm-xe-next rmk-arm/drm-armada-devel rmk-arm/drm-armada-fixes next-20250501]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/oushixiong1025-163-com/drm-p…
base: https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio.git togreg
patch link: https://lore.kernel.org/r/20250430085658.540746-2-oushixiong1025%40163.com
patch subject: [PATCH 2/3] drm/prime: Support importing DMA-BUF without sg_table
config: arc-randconfig-002-20250501 (https://download.01.org/0day-ci/archive/20250502/202505022224.FCDQ8TCB-lkp@…)
compiler: arc-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250502/202505022224.FCDQ8TCB-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505022224.FCDQ8TCB-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/gpu/drm/drm_prime.c:925:24: warning: no previous prototype for 'drm_gem_prime_import_dev_skip_map' [-Wmissing-prototypes]
925 | struct drm_gem_object *drm_gem_prime_import_dev_skip_map(struct drm_device *dev,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/drm_gem_prime_import_dev_skip_map +925 drivers/gpu/drm/drm_prime.c
913
914 /**
915 * drm_gem_prime_import_dev_skip_map - core implementation of the import callback
916 * @dev: drm_device to import into
917 * @dma_buf: dma-buf object to import
918 * @attach_dev: struct device to dma_buf attach
919 *
920 * This function exports a dma-buf without get it's scatter/gather table.
921 *
922 * Drivers who need to get an scatter/gather table for objects need to call
923 * drm_gem_prime_import_dev() instead.
924 */
> 925 struct drm_gem_object *drm_gem_prime_import_dev_skip_map(struct drm_device *dev,
926 struct dma_buf *dma_buf,
927 struct device *attach_dev)
928 {
929 struct dma_buf_attachment *attach;
930 struct drm_gem_object *obj;
931 int ret;
932
933 if (dma_buf->ops == &drm_gem_prime_dmabuf_ops) {
934 obj = dma_buf->priv;
935 if (obj->dev == dev) {
936 /*
937 * Importing dmabuf exported from our own gem increases
938 * refcount on gem itself instead of f_count of dmabuf.
939 */
940 drm_gem_object_get(obj);
941 return obj;
942 }
943 }
944
945 attach = dma_buf_attach(dma_buf, attach_dev, true);
946 if (IS_ERR(attach))
947 return ERR_CAST(attach);
948
949 get_dma_buf(dma_buf);
950
951 obj = dev->driver->gem_prime_import_attachment(dev, attach);
952 if (IS_ERR(obj)) {
953 ret = PTR_ERR(obj);
954 goto fail_detach;
955 }
956
957 obj->import_attach = attach;
958 obj->resv = dma_buf->resv;
959
960 return obj;
961
962 fail_detach:
963 dma_buf_detach(dma_buf, attach);
964 dma_buf_put(dma_buf);
965
966 return ERR_PTR(ret);
967 }
968 EXPORT_SYMBOL(drm_gem_prime_import_dev_skip_map);
969
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on jic23-iio/togreg]
[also build test WARNING on char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus usb/usb-testing usb/usb-next usb/usb-linus xen-tip/linux-next linus/master v6.15-rc4]
[cannot apply to tegra/for-next drm-xe/drm-xe-next rmk-arm/drm-armada-devel rmk-arm/drm-armada-fixes next-20250430]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/oushixiong1025-163-com/drm-p…
base: https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio.git togreg
patch link: https://lore.kernel.org/r/20250430085658.540746-2-oushixiong1025%40163.com
patch subject: [PATCH 2/3] drm/prime: Support importing DMA-BUF without sg_table
config: arm64-randconfig-003-20250501 (https://download.01.org/0day-ci/archive/20250501/202505011655.qTmh4UA7-lkp@…)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250501/202505011655.qTmh4UA7-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505011655.qTmh4UA7-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/gpu/drm/drm_prime.c:925:24: warning: no previous prototype for function 'drm_gem_prime_import_dev_skip_map' [-Wmissing-prototypes]
925 | struct drm_gem_object *drm_gem_prime_import_dev_skip_map(struct drm_device *dev,
| ^
drivers/gpu/drm/drm_prime.c:925:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
925 | struct drm_gem_object *drm_gem_prime_import_dev_skip_map(struct drm_device *dev,
| ^
| static
1 warning generated.
vim +/drm_gem_prime_import_dev_skip_map +925 drivers/gpu/drm/drm_prime.c
913
914 /**
915 * drm_gem_prime_import_dev_skip_map - core implementation of the import callback
916 * @dev: drm_device to import into
917 * @dma_buf: dma-buf object to import
918 * @attach_dev: struct device to dma_buf attach
919 *
920 * This function exports a dma-buf without get it's scatter/gather table.
921 *
922 * Drivers who need to get an scatter/gather table for objects need to call
923 * drm_gem_prime_import_dev() instead.
924 */
> 925 struct drm_gem_object *drm_gem_prime_import_dev_skip_map(struct drm_device *dev,
926 struct dma_buf *dma_buf,
927 struct device *attach_dev)
928 {
929 struct dma_buf_attachment *attach;
930 struct drm_gem_object *obj;
931 int ret;
932
933 if (dma_buf->ops == &drm_gem_prime_dmabuf_ops) {
934 obj = dma_buf->priv;
935 if (obj->dev == dev) {
936 /*
937 * Importing dmabuf exported from our own gem increases
938 * refcount on gem itself instead of f_count of dmabuf.
939 */
940 drm_gem_object_get(obj);
941 return obj;
942 }
943 }
944
945 attach = dma_buf_attach(dma_buf, attach_dev, true);
946 if (IS_ERR(attach))
947 return ERR_CAST(attach);
948
949 get_dma_buf(dma_buf);
950
951 obj = dev->driver->gem_prime_import_attachment(dev, attach);
952 if (IS_ERR(obj)) {
953 ret = PTR_ERR(obj);
954 goto fail_detach;
955 }
956
957 obj->import_attach = attach;
958 obj->resv = dma_buf->resv;
959
960 return obj;
961
962 fail_detach:
963 dma_buf_detach(dma_buf, attach);
964 dma_buf_put(dma_buf);
965
966 return ERR_PTR(ret);
967 }
968 EXPORT_SYMBOL(drm_gem_prime_import_dev_skip_map);
969
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 4/29/25 08:39, Simona Vetter wrote:
> Catching up after spring break, hence the late reply ...
>
> On Fri, Apr 11, 2025 at 02:34:37PM -0400, Nicolas Dufresne wrote:
>> Le jeudi 10 avril 2025 à 16:53 +0200, Bastien Curutchet a écrit :
>>> There is no way to transmit the DMA address of a buffer to userspace.
>>> Some UIO users need this to handle DMA from userspace.
>>
>> To me this API is against all safe practice we've been pushing forward
>> and has no place in DMA_BUF API.
>>
>> If this is fine for the UIO subsystem to pass around physicial
>> addresses, then make this part of the UIO device ioctl.
>
> Yeah, this has no business in dma-buf since the entire point of dma-buf
> was to stop all the nasty "just pass raw dma addr in userspace" hacks that
> preceeded it.
>
> And over the years since dma-buf landed, we've removed a lot of these,
> like dri1 drivers. Or where that's not possible like with fbdev, hid the
> raw dma addr uapi behind a Kconfig.
>
> I concur with the overall sentiment that this should be done in
> vfio/iommufd interfaces, maybe with some support added to map dma-buf. I
> think patches for that have been floating around for a while, but I lost a
> bit the status of where exactly they are.
My take away is that we need to have a documented way for special driver specific interfaces in DMA-buf.
In other words DMA-buf has some standardized rules of doing things which every implementation should follow. The implementations might of course still have bugs (e.g. allocate memory for a dma_fence operation), but at least we have documented what should be done and what's forbidden.
What is still missing in the documentation is the use case when you have for example vfio which wants to talk to iommufd through a specialized interface. This doesn't necessarily needs to be part of DMA-buf, but we should still document "do it this way" because that has already worked in the last ten use cases and we don't want people to re-invent the wheel in a new funky way which then later turns out to not work.
Regards,
Christian.
>
> Cheers, Sima
>
>>
>> regards,
>> Nicolas
>>
>>>
>>> Add a new dma_buf_ops operation that returns the DMA address.
>>> Add a new ioctl to transmit this DMA address to userspace.
>>>
>>> Signed-off-by: Bastien Curutchet <bastien.curutchet(a)bootlin.com>
>>> ---
>>> drivers/dma-buf/dma-buf.c | 21 +++++++++++++++++++++
>>> include/linux/dma-buf.h | 1 +
>>> include/uapi/linux/dma-buf.h | 1 +
>>> 3 files changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index
>>> 398418bd9731ad7a3a1f12eaea6a155fa77a22fe..cbbb518981e54e50f479c3d1fcf
>>> 6da6971f639c1 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -454,6 +454,24 @@ static long dma_buf_import_sync_file(struct
>>> dma_buf *dmabuf,
>>> }
>>> #endif
>>>
>>> +static int dma_buf_get_dma_addr(struct dma_buf *dmabuf, u64 __user
>>> *arg)
>>> +{
>>> + u64 addr;
>>> + int ret;
>>> +
>>> + if (!dmabuf->ops->get_dma_addr)
>>> + return -EINVAL;
>>> +
>>> + ret = dmabuf->ops->get_dma_addr(dmabuf, &addr);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (copy_to_user(arg, &addr, sizeof(u64)))
>>> + return -EFAULT;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static long dma_buf_ioctl(struct file *file,
>>> unsigned int cmd, unsigned long arg)
>>> {
>>> @@ -504,6 +522,9 @@ static long dma_buf_ioctl(struct file *file,
>>> return dma_buf_import_sync_file(dmabuf, (const void
>>> __user *)arg);
>>> #endif
>>>
>>> + case DMA_BUF_IOCTL_GET_DMA_ADDR:
>>> + return dma_buf_get_dma_addr(dmabuf, (u64 __user
>>> *)arg);
>>> +
>>> default:
>>> return -ENOTTY;
>>> }
>>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>>> index
>>> 36216d28d8bdc01a9c9c47e27c392413f7f6c5fb..ed4bf15d3ce82e7a86323fff459
>>> 699a9bc8baa3b 100644
>>> --- a/include/linux/dma-buf.h
>>> +++ b/include/linux/dma-buf.h
>>> @@ -285,6 +285,7 @@ struct dma_buf_ops {
>>>
>>> int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
>>> void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map
>>> *map);
>>> + int (*get_dma_addr)(struct dma_buf *dmabuf, u64 *addr);
>>> };
>>>
>>> /**
>>> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-
>>> buf.h
>>> index
>>> 5a6fda66d9adf01438619e7e67fa69f0fec2d88d..f3aba46942042de6a2e3a4cca3e
>>> b3f87175e29c9 100644
>>> --- a/include/uapi/linux/dma-buf.h
>>> +++ b/include/uapi/linux/dma-buf.h
>>> @@ -178,5 +178,6 @@ struct dma_buf_import_sync_file {
>>> #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, __u64)
>>> #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2,
>>> struct dma_buf_export_sync_file)
>>> #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct
>>> dma_buf_import_sync_file)
>>> +#define DMA_BUF_IOCTL_GET_DMA_ADDR _IOR(DMA_BUF_BASE, 4, __u64
>>> *)
>>>
>>> #endif
>
From: Rob Clark <robdclark(a)chromium.org>
Conversion to DRM GPU VA Manager[1], and adding support for Vulkan Sparse
Memory[2] in the form of:
1. A new VM_BIND submitqueue type for executing VM MSM_SUBMIT_BO_OP_MAP/
MAP_NULL/UNMAP commands
2. A new VM_BIND ioctl to allow submitting batches of one or more
MAP/MAP_NULL/UNMAP commands to a VM_BIND submitqueue
I did not implement support for synchronous VM_BIND commands. Since
userspace could just immediately wait for the `SUBMIT` to complete, I don't
think we need this extra complexity in the kernel. Synchronous/immediate
VM_BIND operations could be implemented with a 2nd VM_BIND submitqueue.
The corresponding mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32533
Changes in v3:
- Switched to seperate VM_BIND ioctl. This makes the UABI a bit
cleaner, but OTOH the userspace code was cleaner when the end result
of either type of VkQueue lead to the same ioctl. So I'm a bit on
the fence.
- Switched to doing the gpuvm bookkeeping synchronously, and only
deferring the pgtable updates. This avoids needing to hold any resv
locks in the fence signaling path, resolving the last shrinker related
lockdep complaints. OTOH it means userspace can trigger invalid
pgtable updates with multiple VM_BIND queues. In this case, we ensure
that unmaps happen completely (to prevent userspace from using this to
access free'd pages), mark the context as unusable, and move on with
life.
- Link to v2: https://lore.kernel.org/all/20250319145425.51935-1-robdclark@gmail.com/
Changes in v2:
- Dropped Bibek Kumar Patro's arm-smmu patches[3], which have since been
merged.
- Pre-allocate all the things, and drop HACK patch which disabled shrinker.
This includes ensuring that vm_bo objects are allocated up front, pre-
allocating VMA objects, and pre-allocating pages used for pgtable updates.
The latter utilizes io_pgtable_cfg callbacks for pgtable alloc/free, that
were initially added for panthor.
- Add back support for BO dumping for devcoredump.
- Link to v1 (RFC): https://lore.kernel.org/dri-devel/20241207161651.410556-1-robdclark@gmail.c…
[1] https://www.kernel.org/doc/html/next/gpu/drm-mm.html#drm-gpuvm
[2] https://docs.vulkan.org/spec/latest/chapters/sparsemem.html
[3] https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=909700
Rob Clark (33):
drm/gpuvm: Don't require obj lock in destructor path
drm/gpuvm: Allow VAs to hold soft reference to BOs
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
drm/msm: Rename msm_file_private -> msm_context
drm/msm: Improve msm_context comments
drm/msm: Rename msm_gem_address_space -> msm_gem_vm
drm/msm: Remove vram carveout support
drm/msm: Collapse vma allocation and initialization
drm/msm: Collapse vma close and delete
drm/msm: Don't close VMAs on purge
drm/msm: drm_gpuvm conversion
drm/msm: Convert vm locking
drm/msm: Use drm_gpuvm types more
drm/msm: Split out helper to get iommu prot flags
drm/msm: Add mmu support for non-zero offset
drm/msm: Add PRR support
drm/msm: Rename msm_gem_vma_purge() -> _unmap()
drm/msm: Lazily create context VM
drm/msm: Add opt-in for VM_BIND
drm/msm: Mark VM as unusable on GPU hangs
drm/msm: Add _NO_SHARE flag
drm/msm: Crashdump prep for sparse mappings
drm/msm: rd dumping prep for sparse mappings
drm/msm: Crashdec support for sparse
drm/msm: rd dumping support for sparse
drm/msm: Extract out syncobj helpers
drm/msm: Use DMA_RESV_USAGE_BOOKKEEP/KERNEL
drm/msm: Add VM_BIND submitqueue
drm/msm: Support IO_PGTABLE_QUIRK_NO_WARN_ON
drm/msm: Support pgtable preallocation
drm/msm: Split out map/unmap ops
drm/msm: Add VM_BIND ioctl
drm/msm: Bump UAPI version
drivers/gpu/drm/drm_gpuvm.c | 15 +-
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/Makefile | 1 +
drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 25 +-
drivers/gpu/drm/msm/adreno/a2xx_gpummu.c | 5 +-
drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 17 +-
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 32 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.h | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 49 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 6 +-
drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 10 +-
drivers/gpu/drm/msm/adreno/adreno_device.c | 4 -
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 88 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 23 +-
.../drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_formats.h | 2 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_kms.c | 18 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 14 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h | 4 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 6 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_kms.c | 28 +-
drivers/gpu/drm/msm/disp/mdp4/mdp4_plane.c | 12 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 4 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_kms.c | 19 +-
drivers/gpu/drm/msm/disp/mdp5/mdp5_plane.c | 12 +-
drivers/gpu/drm/msm/dsi/dsi_host.c | 14 +-
drivers/gpu/drm/msm/msm_drv.c | 183 +--
drivers/gpu/drm/msm/msm_drv.h | 35 +-
drivers/gpu/drm/msm/msm_fb.c | 18 +-
drivers/gpu/drm/msm/msm_fbdev.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 489 +++----
drivers/gpu/drm/msm/msm_gem.h | 217 ++-
drivers/gpu/drm/msm/msm_gem_prime.c | 15 +
drivers/gpu/drm/msm/msm_gem_shrinker.c | 4 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 295 ++--
drivers/gpu/drm/msm/msm_gem_vma.c | 1265 +++++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 171 ++-
drivers/gpu/drm/msm/msm_gpu.h | 132 +-
drivers/gpu/drm/msm/msm_iommu.c | 298 +++-
drivers/gpu/drm/msm/msm_kms.c | 18 +-
drivers/gpu/drm/msm/msm_kms.h | 2 +-
drivers/gpu/drm/msm/msm_mmu.h | 38 +-
drivers/gpu/drm/msm/msm_rd.c | 62 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 86 +-
drivers/gpu/drm/msm/msm_syncobj.c | 172 +++
drivers/gpu/drm/msm/msm_syncobj.h | 37 +
drivers/iommu/io-pgtable-arm.c | 18 +-
include/drm/drm_gpuvm.h | 12 +-
include/linux/io-pgtable.h | 8 +
include/uapi/drm/msm_drm.h | 149 +-
57 files changed, 3012 insertions(+), 1216 deletions(-)
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.c
create mode 100644 drivers/gpu/drm/msm/msm_syncobj.h
--
2.49.0
On 4/24/25 15:02, Philipp Stanner wrote:
> In nouveau_fence_done(), a fence is checked for being signaled by
> manually evaluating the base fence's bits. This can be done in a
> canonical manner through dma_fence_is_signaled().
>
> Replace the bit-check with dma_fence_is_signaled().
>
> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
I think the bit check was used here as fast path optimization because we later call dma_fence_is_signaled() anyway.
Feel free to add my acked-by, but honestly what nouveau does here looks rather suspicious to me.
Regards,
Christian.
> ---
> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index fb9811938c82..d5654e26d5bc 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -253,7 +253,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> struct nouveau_channel *chan;
> unsigned long flags;
>
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
> + if (dma_fence_is_signaled(&fence->base))
> return true;
>
> spin_lock_irqsave(&fctx->lock, flags);
Hi,
This series is the follow-up of the discussion that John and I had some
time ago here:
https://lore.kernel.org/all/CANDhNCquJn6bH3KxKf65BWiTYLVqSd9892-xtFDHHqqyrr…
The initial problem we were discussing was that I'm currently working on
a platform which has a memory layout with ECC enabled. However, enabling
the ECC has a number of drawbacks on that platform: lower performance,
increased memory usage, etc. So for things like framebuffers, the
trade-off isn't great and thus there's a memory region with ECC disabled
to allocate from for such use cases.
After a suggestion from John, I chose to first start using heap
allocations flags to allow for userspace to ask for a particular ECC
setup. This is then backed by a new heap type that runs from reserved
memory chunks flagged as such, and the existing DT properties to specify
the ECC properties.
After further discussion, it was considered that flags were not the
right solution, and relying on the names of the heaps would be enough to
let userspace know the kind of buffer it deals with.
Thus, even though the uAPI part of it has been dropped in this second
version, we still need a driver to create heaps out of carved-out memory
regions. In addition to the original usecase, a similar driver can be
found in BSPs from most vendors, so I believe it would be a useful
addition to the kernel.
I submitted a draft PR to the DT schema for the bindings used in this
PR:
https://github.com/devicetree-org/dt-schema/pull/138
Let me know what you think,
Maxime
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v3:
- Reworked global variable patch
- Link to v2: https://lore.kernel.org/r/20250401-dma-buf-ecc-heap-v2-0-043fd006a1af@kerne…
Changes in v2:
- Add vmap/vunmap operations
- Drop ECC flags uapi
- Rebase on top of 6.14
- Link to v1: https://lore.kernel.org/r/20240515-dma-buf-ecc-heap-v1-0-54cbbd049511@kerne…
---
Maxime Ripard (2):
dma-buf: heaps: system: Remove global variable
dma-buf: heaps: Introduce a new heap for reserved memory
drivers/dma-buf/heaps/Kconfig | 8 +
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/carveout_heap.c | 360 ++++++++++++++++++++++++++++++++++
drivers/dma-buf/heaps/system_heap.c | 3 +-
4 files changed, 370 insertions(+), 2 deletions(-)
---
base-commit: fcbf30774e82a441890b722bf0c26542fb82150f
change-id: 20240515-dma-buf-ecc-heap-28a311d2c94e
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Tue, Apr 22, 2025 at 4:01 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Tue, Apr 22, 2025 at 12:57 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
> > <alexei.starovoitov(a)gmail.com> wrote:
> > >
> > > On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > > > new file mode 100644
> > > > > > index 000000000000..b4b8be1d6aa4
> > > > > > --- /dev/null
> > > > > > +++ b/kernel/bpf/dmabuf_iter.c
> > > > >
> > > > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > > > hear other folks thoughts on this.
> > > >
> > > > This is fine with me, and would save us the extra
> > > > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > > > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > > > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > > > implementation into drivers/dma-buf?
> > >
> > > The driver directory would need to 'depends on BPF_SYSCALL'.
> > > Are you sure you want this?
> > > imo kernel/bpf/ is fine for this.
> >
> > I don't have a strong preference so either way is fine with me. The
> > main difference I see is maintainership.
> >
> > > You also probably want
> > > .feature = BPF_ITER_RESCHED
> > > in bpf_dmabuf_reg_info.
> >
> > Thank you, this looks like a good idea.
> >
> > > Also have you considered open coded iterator for dmabufs?
> > > Would it help with the interface to user space?
> >
> > I read through the open coded iterator patches, and it looks like they
> > would be slightly more efficient by avoiding seq_file overhead. As far
> > as the interface to userspace, for the purpose of replacing what's
> > currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
> > a difference. However it looks like if I were to try to replace all of
> > our userspace analysis of dmabufs with a single bpf program then an
> > open coded iterator would make that much easier. I had not considered
> > attempting that.
> >
> > One problem I see with open coded iterators is that support is much
> > more recent (2023 vs 2020). We support longterm stable kernels (back
> > to 5.4 currently but probably 5.10 by the time this would be used), so
> > it seems like it would be harder to backport the kernel support for an
> > open-coded iterator that far since it only goes back as far as 6.6
> > now. Actually it doesn't look like it is possible while also
> > maintaining the stable ABI we provide to device vendors. Which means
> > we couldn't get rid of the dmabuf sysfs stats userspace dependency
> > until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
> > iterator here for now.
>
> Fair enough, but please implement both and backport only
> the old style pinned iterator.
Ok, will do.
> The code will be mostly shared between them.
> bpf_iter_dmabuf_new/_next will be more flexible with more
> options to return data to user space. Like android can invent
> their own binary format. Pack into it in a bpf prog, send to
> bpf ringbuf and unmarshal efficiently in user space.
> Instead of being limited to text output that pinned iterators
> are supposed to do usually.
Also a neat idea!
> You can do binary with bpf_seq_write() too, but it's rare.
>
> Also please provide full bpf prog that you'll use in production
> in a selftest instead of trivial:
> +SEC("iter/dmabuf")
> +int dmabuf_collector(struct bpf_iter__dmabuf *ctx)
>
> just to make sure it's tested end to end and future changes
> won't break it.
The final bpf program should be something pretty close to that, but
I'll start working on the AOSP side as well so I can put up patches.
>
> pw-bot: cr
Hello Jared,
On Wed, 23 Apr 2025 at 00:49, Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Hi all,
>
> This patch series is based on a previous discussion around CMA heap
> naming. [1] The heap's name depends on the device name, which is
> generally "reserved", "linux,cma", or "default-pool", but could be any
> arbitrary name given to the default CMA area in the devicetree. For a
> consistent userspace interface, the series introduces a constant name
> for the CMA heap, and for backwards compatibility, an additional Kconfig
> that controls the creation of a legacy-named heap with the same CMA
> backing.
>
> The ideas to handle backwards compatibility in [1] are to either use a
> symlink or add a heap node with a duplicate minor. However, I assume
> that we don't want to create symlinks in /dev from module initcalls, and
> attempting to duplicate minors would cause device_create() to fail.
> Because of these drawbacks, after brainstorming with Maxime Ripard, I
> went with creating a new node in devtmpfs with its own minor. This
> admittedly makes it a little unclear that the old and new nodes are
> backed by the same heap when both are present. The only approach that I
> think would provide total clarity on this in userspace is symlinking,
> which seemed like a fairly involved solution for devtmpfs, but if I'm
> wrong on this, please let me know.
Thanks indeed for this patch; just one minor nit: the link referred to
as [1] here seems to be missing. Could you please add it? This would
make it easier to follow the chain of discussion in posterity.
>
> Changelog:
> v2: Use tabs instead of spaces for large vertical alignment.
>
> Jared Kangas (2):
> dma-buf: heaps: Parameterize heap name in __add_cma_heap()
> dma-buf: heaps: Give default CMA heap a fixed name
>
> Documentation/userspace-api/dma-buf-heaps.rst | 11 ++++---
> drivers/dma-buf/heaps/Kconfig | 10 +++++++
> drivers/dma-buf/heaps/cma_heap.c | 30 ++++++++++++++-----
> 3 files changed, 40 insertions(+), 11 deletions(-)
>
> --
> 2.49.0
>
Best,
Sumit
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> The CMA heap's name in devtmpfs can vary depending on how the heap is
> defined. Its name defaults to "reserved", but if a CMA area is defined
> in the devicetree, the heap takes on the devicetree node's name, such as
> "default-pool" or "linux,cma". To simplify naming, just name it
> "default_cma", and keep a legacy node in place backed by the same
> underlying structure for backwards compatibility.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Once again, thanks for working out how to improve the standard naming
while keeping compatibility.
I do still hope we can get to the point where other cma regions can be
registered as heaps with unique/purpose-specific names, but I can see
having a standard name for the default region is a nice improvement.
Acked-by: John Stultz <jstultz(a)google.com>
thanks
-john
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Prepare for the introduction of a fixed-name CMA heap by replacing the
> unused void pointer parameter in __add_cma_heap() with the heap name.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Thanks so much for taking this effort on. Looks good to me!
Acked-by: John Stultz <jstultz(a)google.com>
On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > > > new file mode 100644
> > > > index 000000000000..b4b8be1d6aa4
> > > > --- /dev/null
> > > > +++ b/kernel/bpf/dmabuf_iter.c
> > >
> > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > hear other folks thoughts on this.
> >
> > This is fine with me, and would save us the extra
> > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > implementation into drivers/dma-buf?
>
> The driver directory would need to 'depends on BPF_SYSCALL'.
> Are you sure you want this?
> imo kernel/bpf/ is fine for this.
I don't have a strong preference so either way is fine with me. The
main difference I see is maintainership.
> You also probably want
> .feature = BPF_ITER_RESCHED
> in bpf_dmabuf_reg_info.
Thank you, this looks like a good idea.
> Also have you considered open coded iterator for dmabufs?
> Would it help with the interface to user space?
I read through the open coded iterator patches, and it looks like they
would be slightly more efficient by avoiding seq_file overhead. As far
as the interface to userspace, for the purpose of replacing what's
currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
a difference. However it looks like if I were to try to replace all of
our userspace analysis of dmabufs with a single bpf program then an
open coded iterator would make that much easier. I had not considered
attempting that.
One problem I see with open coded iterators is that support is much
more recent (2023 vs 2020). We support longterm stable kernels (back
to 5.4 currently but probably 5.10 by the time this would be used), so
it seems like it would be harder to backport the kernel support for an
open-coded iterator that far since it only goes back as far as 6.6
now. Actually it doesn't look like it is possible while also
maintaining the stable ABI we provide to device vendors. Which means
we couldn't get rid of the dmabuf sysfs stats userspace dependency
until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
iterator here for now.
On Mon, Apr 14, 2025 at 09:21:25PM +0200, Thomas Petazzoni wrote:
> > "UIO is a broken legacy mess, so let's add more broken things
> > to it as broken + broken => still broken, so no harm done", am I
> > getting that right?
>
> Who says UIO is a "broken legacy mess"? Only you says so. I don't see
> any indication anywhere in the kernel tree suggesting that UIO is
> considered a broken legacy mess.
Explain what the difference is between UIO and VFIO, especially VFIO
no-iommu mode?
I've always understood that UIO is for very simple devices that cannot
do DMA. So it's very simple operating model and simple security work
fine.
IMHO, if the can use DMA it should use VFIO. If you have no iommu then
you should use the VFIO unsafe no-iommu path. It still provides a
solid framework.
As to this series, I have seen a number of requests to improve the
VFIO no-iommu path to allow working with the existing IOAS scheme to
register memory but to allow the kernel the return the no-iommu
DMAable address of the IOAS pinned memory. This would replace the
hacky use of mlock and /proc/XX/pagemap that people use today.
If that were done, could you use VFIO no-iommu?
> Keep in mind that when you're running code as root, you can load a
> kernel module, which can do anything on the system security-wise. So
> letting UIO expose MMIO registers of devices to userspace applications
> running as root is not any worse than that.
That isn't fully true.. UIO isn't fitting into the security model by
allowing DMA capable devices to be exposed without checking for
CAP_SYS_RAW_IO first.
Jason
On Mon, Apr 21, 2025 at 10:58 AM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 3:53 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > The dmabuf iterator traverses the list of all DMA buffers. The list is
> > maintained only when CONFIG_DEBUG_FS is enabled.
> >
> > DMA buffers are refcounted through their associated struct file. A
> > reference is taken on each buffer as the list is iterated to ensure each
> > buffer persists for the duration of the bpf program execution without
> > holding the list mutex.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > ---
> > include/linux/btf_ids.h | 1 +
> > kernel/bpf/Makefile | 3 +
> > kernel/bpf/dmabuf_iter.c | 130 +++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 134 insertions(+)
> > create mode 100644 kernel/bpf/dmabuf_iter.c
> >
> > diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> > index 139bdececdcf..899ead57d89d 100644
> > --- a/include/linux/btf_ids.h
> > +++ b/include/linux/btf_ids.h
> > @@ -284,5 +284,6 @@ extern u32 bpf_cgroup_btf_id[];
> > extern u32 bpf_local_storage_map_btf_id[];
> > extern u32 btf_bpf_map_id[];
> > extern u32 bpf_kmem_cache_btf_id[];
> > +extern u32 bpf_dmabuf_btf_id[];
>
> This is not necessary. See below.
>
> >
> > #endif
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 70502f038b92..5b30d37ef055 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -53,6 +53,9 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o
> > obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o
> > +ifeq ($(CONFIG_DEBUG_FS),y)
> > +obj-$(CONFIG_BPF_SYSCALL) += dmabuf_iter.o
> > +endif
> >
> > CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE)
> > CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE)
> > diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..b4b8be1d6aa4
> > --- /dev/null
> > +++ b/kernel/bpf/dmabuf_iter.c
>
> Maybe we should add this file to drivers/dma-buf. I would like to
> hear other folks thoughts on this.
This is fine with me, and would save us the extra
CONFIG_DMA_SHARED_BUFFER check that's currently needed in
kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
Sumit / Christian any objections to moving the dmabuf bpf iterator
implementation into drivers/dma-buf?
> > @@ -0,0 +1,130 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2025 Google LLC */
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/kernel.h>
> > +#include <linux/seq_file.h>
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf)
>
> Use BTF_ID_LIST_SINGLE(), then we don't need this in btf_ids.h
>
> > +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf)
> > +
> > +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct dma_buf *dmabuf, *ret = NULL;
> > +
> > + if (*pos) {
> > + *pos = 0;
> > + return NULL;
> > + }
> > + /* Look for the first buffer we can obtain a reference to.
> > + * The list mutex does not protect a dmabuf's refcount, so it can be
> > + * zeroed while we are iterating. Therefore we cannot call get_dma_buf()
> > + * since the caller of this program may not already own a reference to
> > + * the buffer.
> > + */
>
> We should use kernel comment style for new code. IOW,
> /*
> * Look for ...
> */
>
>
> Thanks,
> Song
Thanks, I have incorporated all of your comments and retested. I will
give some time for more feedback before sending a v2.
>
> [...]
On Thu, Apr 17, 2025 at 1:26 PM Song Liu <song(a)kernel.org> wrote:
>
> On Thu, Apr 17, 2025 at 9:05 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> > > [...]
> > > > >
> > > > > Here is another rookie question, it appears to me there is a file descriptor
> > > > > associated with each DMA buffer, can we achieve the same goal with
> > > > > a task-file iterator?
> > > >
> > > > That would find almost all of them, but not the kernel-only
> > > > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > > > If there's a leak, it's likely to show up in kernel_rss because some
> > > > driver forgot to release its reference(s).) Also wouldn't that be a
> > > > ton more iterations since we'd have to visit every FD to find the
> > > > small portion that are dmabufs? I'm not actually sure if buffers that
> > > > have been mapped, and then have had their file descriptors closed
> > > > would show up in task_struct->files; if not I think that would mean
> > > > scanning both files and vmas for each task.
> > >
> > > I don't think scanning all FDs to find a small portion of specific FDs
> > > is a real issue. We have a tool that scans all FDs in the system and
> > > only dump data for perf_event FDs. I think it should be easy to
> > > prototype a tool by scanning all files and all vmas. If that turns out
> > > to be very slow, which I highly doubt will be, we can try other
> > > approaches.
> >
> > But this will not find *all* the buffers, and that defeats the purpose
> > of having the iterator.
>
> Do you mean this approach cannot get kernel only allocations? If
> that's the case, we probably do need a separate iterator. I am
> interested in other folks thoughts on this.
Correct.
> > > OTOH, I am wondering whether we can build a more generic iterator
> > > for a list of objects. Adding a iterator for each important kernel lists
> > > seems not scalable in the long term.
> >
> > I think the wide variety of differences in locking for different
> > objects would make this difficult to do in a generic way.
>
> Agreed it is not easy to build a generic solution. But with the
> help from BTF, we can probably build something that covers
> a large number of use cases.
I'm curious what this would look like. I guess a good test would be to
see if anything would work for some subset of the already existing
iterators.
On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> [...]
> > >
> > > Here is another rookie question, it appears to me there is a file descriptor
> > > associated with each DMA buffer, can we achieve the same goal with
> > > a task-file iterator?
> >
> > That would find almost all of them, but not the kernel-only
> > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > If there's a leak, it's likely to show up in kernel_rss because some
> > driver forgot to release its reference(s).) Also wouldn't that be a
> > ton more iterations since we'd have to visit every FD to find the
> > small portion that are dmabufs? I'm not actually sure if buffers that
> > have been mapped, and then have had their file descriptors closed
> > would show up in task_struct->files; if not I think that would mean
> > scanning both files and vmas for each task.
>
> I don't think scanning all FDs to find a small portion of specific FDs
> is a real issue. We have a tool that scans all FDs in the system and
> only dump data for perf_event FDs. I think it should be easy to
> prototype a tool by scanning all files and all vmas. If that turns out
> to be very slow, which I highly doubt will be, we can try other
> approaches.
But this will not find *all* the buffers, and that defeats the purpose
of having the iterator.
> OTOH, I am wondering whether we can build a more generic iterator
> for a list of objects. Adding a iterator for each important kernel lists
> seems not scalable in the long term.
I think the wide variety of differences in locking for different
objects would make this difficult to do in a generic way.