Am Freitag, 6. Juni 2025, 08:28:23 Mitteleuropäische Sommerzeit schrieb Tomeu Vizoso:
> This uses the SHMEM DRM helpers and we map right away to the CPU and NPU
> sides, as all buffers are expected to be accessed from both.
>
> v2:
> - Sync the IOMMUs for the other cores when mapping and unmapping.
>
> v3:
> - Make use of GPL-2.0-only for the copyright notice (Jeff Hugo)
>
> v6:
> - Use mutexes guard (Markus Elfring)
>
> v7:
> - Assign its own IOMMU domain to each client, for isolation (Daniel
> Stone and Robin Murphy)
>
> Reviewed-by: Jeffrey Hugo <quic_jhugo(a)quicinc.com>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> diff --git a/drivers/accel/rocket/rocket_gem.c b/drivers/accel/rocket/rocket_gem.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..61b7f970a6885aa13784daa1222611a02aa10dee
> --- /dev/null
> +++ b/drivers/accel/rocket/rocket_gem.c
> @@ -0,0 +1,115 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright 2024-2025 Tomeu Vizoso <tomeu(a)tomeuvizoso.net> */
> +
> +#include <drm/drm_device.h>
> +#include <drm/drm_utils.h>
> +#include <drm/rocket_accel.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/iommu.h>
> +
> +#include "rocket_device.h"
> +#include "rocket_drv.h"
> +#include "rocket_gem.h"
> +
> +static void rocket_gem_bo_free(struct drm_gem_object *obj)
> +{
> + struct rocket_device *rdev = to_rocket_device(obj->dev);
> + struct rocket_gem_object *bo = to_rocket_bo(obj);
> + size_t unmapped;
> +
> + drm_WARN_ON(obj->dev, bo->base.pages_use_count > 1);
This should probably be
drm_WARN_ON(obj->dev, refcount_read(&bo->base.pages_use_count) > 1);
as pages_use_count is of type refcount_t since
commit 051b6646d36d ("drm/shmem-helper: Use refcount_t for pages_use_count")
Heiko
On Fri, Jun 6, 2025 at 1:29 AM Tomeu Vizoso <tomeu(a)tomeuvizoso.net> wrote:
>
> Using the DRM GPU scheduler infrastructure, with a scheduler for each
> core.
>
> Userspace can decide for a series of tasks to be executed sequentially
> in the same core, so SRAM locality can be taken advantage of.
>
> The job submission code was initially based on Panfrost.
>
> v2:
> - Remove hardcoded number of cores
> - Misc. style fixes (Jeffrey Hugo)
> - Repack IOCTL struct (Jeffrey Hugo)
>
> v3:
> - Adapt to a split of the register block in the DT bindings (Nicolas
> Frattaroli)
> - Make use of GPL-2.0-only for the copyright notice (Jeff Hugo)
> - Use drm_* logging functions (Thomas Zimmermann)
> - Rename reg i/o macros (Thomas Zimmermann)
> - Add padding to ioctls and check for zero (Jeff Hugo)
> - Improve error handling (Nicolas Frattaroli)
>
> v6:
> - Use mutexes guard (Markus Elfring)
> - Use u64_to_user_ptr (Jeff Hugo)
> - Drop rocket_fence (Rob Herring)
>
> v7:
> - Assign its own IOMMU domain to each client, for isolation (Daniel
> Stone and Robin Murphy)
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
[...]
> --- a/include/uapi/drm/rocket_accel.h
> +++ b/include/uapi/drm/rocket_accel.h
> @@ -12,8 +12,10 @@ extern "C" {
> #endif
>
> #define DRM_ROCKET_CREATE_BO 0x00
> +#define DRM_ROCKET_SUBMIT 0x01
>
> #define DRM_IOCTL_ROCKET_CREATE_BO DRM_IOWR(DRM_COMMAND_BASE + DRM_ROCKET_CREATE_BO, struct drm_rocket_create_bo)
> +#define DRM_IOCTL_ROCKET_SUBMIT DRM_IOW(DRM_COMMAND_BASE + DRM_ROCKET_SUBMIT, struct drm_rocket_submit)
>
> /**
> * struct drm_rocket_create_bo - ioctl argument for creating Rocket BOs.
> @@ -37,6 +39,68 @@ struct drm_rocket_create_bo {
> __u64 offset;
> };
>
> +/**
> + * struct drm_rocket_task - A task to be run on the NPU
> + *
> + * A task is the smallest unit of work that can be run on the NPU.
> + */
> +struct drm_rocket_task {
> + /** Input: DMA address to NPU mapping of register command buffer */
> + __u64 regcmd;
> +
> + /** Input: Number of commands in the register command buffer */
> + __u32 regcmd_count;
> +
> + /** Reserved, must be zero. */
> + __u32 reserved;
> +};
> +
> +/**
> + * struct drm_rocket_job - A job to be run on the NPU
> + *
> + * The kernel will schedule the execution of this job taking into account its
> + * dependencies with other jobs. All tasks in the same job will be executed
> + * sequentially on the same core, to benefit from memory residency in SRAM.
> + */
> +struct drm_rocket_job {
> + /** Input: Pointer to an array of struct drm_rocket_task. */
> + __u64 tasks;
> +
> + /** Input: Pointer to a u32 array of the BOs that are read by the job. */
> + __u64 in_bo_handles;
> +
> + /** Input: Pointer to a u32 array of the BOs that are written to by the job. */
> + __u64 out_bo_handles;
> +
> + /** Input: Number of tasks passed in. */
> + __u32 task_count;
> +
> + /** Input: Number of input BO handles passed in (size is that times 4). */
> + __u32 in_bo_handle_count;
> +
> + /** Input: Number of output BO handles passed in (size is that times 4). */
> + __u32 out_bo_handle_count;
> +
> + /** Reserved, must be zero. */
> + __u32 reserved;
> +};
> +
> +/**
> + * struct drm_rocket_submit - ioctl argument for submitting commands to the NPU.
> + *
> + * The kernel will schedule the execution of these jobs in dependency order.
> + */
> +struct drm_rocket_submit {
> + /** Input: Pointer to an array of struct drm_rocket_job. */
> + __u64 jobs;
> +
> + /** Input: Number of jobs passed in. */
> + __u32 job_count;
Isn't there a problem if you need to expand drm_rocket_job beyond
using the 1 reserved field? You can't add to the struct because then
you don't know the size here. So you have to modify drm_rocket_submit
to modify drm_rocket_job. Maybe better if you plan for that now rather
than later by making the size explicit.
Though etnaviv at least has similar issues.
Rob
> +
> + /** Reserved, must be zero. */
> + __u32 reserved;
> +};
On Mon, 16 Jun 2025 11:21:01 +0200, Clément Le Goffic wrote:
> This series aims to improve the STM32 SPI driver in different areas.
> It adds SPI_READY mode, fixes an issue raised by a kernel bot,
> add the ability to use DMA-MDMA chaining for RX and deprecate an ST bindings
> vendor property.
>
>
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next
Thanks!
[1/6] spi: stm32: Add SPI_READY mode to spi controller
commit: e4feefa5c71912ebfcb97a3dbe2b021fd1cea9d1
[2/6] spi: stm32: Check for cfg availability in stm32_spi_probe
commit: 21f1c800f6620e43f31dfd76709dbac8ebaa5a16
[3/6] dt-bindings: spi: stm32: update bindings with SPI Rx DMA-MDMA chaining
commit: bd60f94a3eb4f80cb66c9687d640554fd0c579d0
[4/6] spi: stm32: use STM32 DMA with STM32 MDMA to enhance DDR use
commit: d17dd2f1d8a1d919e39c6302b024f135a2f90773
[5/6] spi: stm32: deprecate `st,spi-midi-ns` property
commit: 4956bf44524394211ca80aa04d0c9e1e9bb0219d
[6/6] dt-bindings: spi: stm32: deprecate `st,spi-midi-ns` property
commit: 9a944494c299fabf3cc781798eb7c02a0bece364
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
On 2025-06-06 7:28 am, Tomeu Vizoso wrote:
[...]
> diff --git a/drivers/accel/rocket/rocket_device.h b/drivers/accel/rocket/rocket_device.h
> index 10acfe8534f00a7985d40a93f4b2f7f69d43caee..50e46f0516bd1615b5f826c5002a6c0ecbf9aed4 100644
> --- a/drivers/accel/rocket/rocket_device.h
> +++ b/drivers/accel/rocket/rocket_device.h
> @@ -13,6 +13,8 @@
> struct rocket_device {
> struct drm_device ddev;
>
> + struct mutex sched_lock;
> +
> struct mutex iommu_lock;
Just realised I missed this in the last patch, but iommu_lock appears to
be completely unnecessary now.
> struct rocket_core *cores;
[...]
> +static void rocket_job_hw_submit(struct rocket_core *core, struct rocket_job *job)
> +{
> + struct rocket_task *task;
> + bool task_pp_en = 1;
> + bool task_count = 1;
> +
> + /* GO ! */
> +
> + /* Don't queue the job if a reset is in progress */
> + if (atomic_read(&core->reset.pending))
> + return;
> +
> + task = &job->tasks[job->next_task_idx];
> + job->next_task_idx++;
> +
> + rocket_pc_writel(core, BASE_ADDRESS, 0x1);
> +
> + rocket_cna_writel(core, S_POINTER, 0xe + 0x10000000 * core->index);
> + rocket_core_writel(core, S_POINTER, 0xe + 0x10000000 * core->index);
Those really look like bitfield operations rather than actual arithmetic
to me.
> +
> + rocket_pc_writel(core, BASE_ADDRESS, task->regcmd);
I don't see how regcmd is created (I guess that's in userspace?), but
given that it's explicitly u64 all the way through - and especially
since you claim to support 40-bit DMA addresses - it definitely seems
suspicious that the upper 32 bits never seem to be consumed anywhere :/
> + rocket_pc_writel(core, REGISTER_AMOUNTS, (task->regcmd_count + 1) / 2 - 1);
> +
> + rocket_pc_writel(core, INTERRUPT_MASK, PC_INTERRUPT_MASK_DPU_0 | PC_INTERRUPT_MASK_DPU_1);
> + rocket_pc_writel(core, INTERRUPT_CLEAR, PC_INTERRUPT_CLEAR_DPU_0 | PC_INTERRUPT_CLEAR_DPU_1);
> +
> + rocket_pc_writel(core, TASK_CON, ((0x6 | task_pp_en) << 12) | task_count);
> +
> + rocket_pc_writel(core, TASK_DMA_BASE_ADDR, 0x0);
> +
> + rocket_pc_writel(core, OPERATION_ENABLE, 0x1);
> +
> + dev_dbg(core->dev, "Submitted regcmd at 0x%llx to core %d", task->regcmd, core->index);
> +}
[...]
> +static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
> +{
> + struct rocket_job *job = to_rocket_job(sched_job);
> + struct rocket_device *rdev = job->rdev;
> + struct rocket_core *core = sched_to_core(rdev, sched_job->sched);
> + struct dma_fence *fence = NULL;
> + int ret;
> +
> + if (unlikely(job->base.s_fence->finished.error))
> + return NULL;
> +
> + /*
> + * Nothing to execute: can happen if the job has finished while
> + * we were resetting the GPU.
GPU? (Similarly in various other comments/prints)
> + */
> + if (job->next_task_idx == job->task_count)
> + return NULL;
> +
> + fence = rocket_fence_create(core);
> + if (IS_ERR(fence))
> + return fence;
> +
> + if (job->done_fence)
> + dma_fence_put(job->done_fence);
> + job->done_fence = dma_fence_get(fence);
> +
> + ret = pm_runtime_get_sync(core->dev);
> + if (ret < 0)
> + return fence;
> +
> + ret = iommu_attach_group(job->domain, iommu_group_get(core->dev));
I don't see iommu_group_put() anywhere, so you're leaking refcounts all
over.
> + if (ret < 0)
> + return fence;
> +
> + scoped_guard(spinlock, &core->job_lock) {
> + core->in_flight_job = job;
> + rocket_job_hw_submit(core, job);
> + }
> +
> + return fence;
> +}
[...]
> +static void rocket_job_handle_irq(struct rocket_core *core)
> +{
> + u32 status, raw_status;
> +
> + pm_runtime_mark_last_busy(core->dev);
> +
> + status = rocket_pc_readl(core, INTERRUPT_STATUS);
> + raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
> +
> + rocket_pc_writel(core, OPERATION_ENABLE, 0x0);
> + rocket_pc_writel(core, INTERRUPT_CLEAR, 0x1ffff);
What was the point of reading the status registers if we're just going
to blindly clear every possible condition anyway?
> + scoped_guard(spinlock, &core->job_lock)
> + if (core->in_flight_job)
> + rocket_job_handle_done(core, core->in_flight_job);
But then is it really OK to just start the next task regardless of
whether the current task was reporting successful completion or an error?
> +}
> +
> +static void
> +rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
> +{
> + bool cookie;
> +
> + if (!atomic_read(&core->reset.pending))
> + return;
> +
> + /*
> + * Stop the scheduler.
> + *
> + * FIXME: We temporarily get out of the dma_fence_signalling section
> + * because the cleanup path generate lockdep splats when taking locks
> + * to release job resources. We should rework the code to follow this
> + * pattern:
> + *
> + * try_lock
> + * if (locked)
> + * release
> + * else
> + * schedule_work_to_release_later
> + */
> + drm_sched_stop(&core->sched, bad);
> +
> + cookie = dma_fence_begin_signalling();
> +
> + if (bad)
> + drm_sched_increase_karma(bad);
> +
> + /*
> + * Mask job interrupts and synchronize to make sure we won't be
> + * interrupted during our reset.
> + */
> + rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
> + synchronize_irq(core->irq);
...except it's a shared IRQ, so it can still merrily fire at any time.
> +
> + /* Handle the remaining interrupts before we reset. */
> + rocket_job_handle_irq(core);
> +
> + /*
> + * Remaining interrupts have been handled, but we might still have
> + * stuck jobs. Let's make sure the PM counters stay balanced by
> + * manually calling pm_runtime_put_noidle() and
> + * rocket_devfreq_record_idle() for each stuck job.
> + * Let's also make sure the cycle counting register's refcnt is
> + * kept balanced to prevent it from running forever
Comments that don't match the code are more confusing than helpful :/
> + */
> + scoped_guard(spinlock, &core->job_lock) {
> + if (core->in_flight_job)
> + pm_runtime_put_noidle(core->dev);
> +
> + core->in_flight_job = NULL;
> + }
> +
> + /* Proceed with reset now. */
> + pm_runtime_force_suspend(core->dev);
> + pm_runtime_force_resume(core->dev);
Can you guarantee that actually resets the hardware if something else is
holding the power domain open or RPM is disabled? I'm not familiar with
the details of drm_sched, but if there are other jobs queued behind the
stuck one would it even pass the rocket_job_is_idle() check for suspend
to succeed anyway?
Not to mention that you have an actual reset control in the DT binding,
which isn't even optional... :/
> + /* GPU has been reset, we can clear the reset pending bit. */
> + atomic_set(&core->reset.pending, 0);
> +
> + /*
> + * Now resubmit jobs that were previously queued but didn't have a
> + * chance to finish.
> + * FIXME: We temporarily get out of the DMA fence signalling section
> + * while resubmitting jobs because the job submission logic will
> + * allocate memory with the GFP_KERNEL flag which can trigger memory
> + * reclaim and exposes a lock ordering issue.
> + */
> + dma_fence_end_signalling(cookie);
> + drm_sched_resubmit_jobs(&core->sched);
Since I happened to look, this says it's deprecated?
> + cookie = dma_fence_begin_signalling();
> +
> + /* Restart the scheduler */
> + drm_sched_start(&core->sched, 0);
> +
> + dma_fence_end_signalling(cookie);
> +}
> +
> +static enum drm_gpu_sched_stat rocket_job_timedout(struct drm_sched_job *sched_job)
> +{
> + struct rocket_job *job = to_rocket_job(sched_job);
> + struct rocket_device *rdev = job->rdev;
> + struct rocket_core *core = sched_to_core(rdev, sched_job->sched);
> +
> + /*
> + * If the GPU managed to complete this jobs fence, the timeout is
> + * spurious. Bail out.
> + */
> + if (dma_fence_is_signaled(job->done_fence))
> + return DRM_GPU_SCHED_STAT_NOMINAL;
Do we really need the same return condition twice? What if the IRQ fires
immediately after we've made this check, and is handled without delay
such that sychronize_irq() effectively still does nothing? Either way
we've taken longer than the timeout value to observe the job completing
successfully, and either that's significant and worth warning about or
it's not - I don't see any point in trying to (inaccurately) nitpick
*why* it might have happened.
> + /*
> + * Rocket IRQ handler may take a long time to process an interrupt
> + * if there is another IRQ handler hogging the processing.
> + * For example, the HDMI encoder driver might be stuck in the IRQ
> + * handler for a significant time in a case of bad cable connection.
What have HDMI cables got to do with anything here? Yes, in general IRQ
latency can be high, since CPUs can have IRQs masked and/or be taking
higher-priority interrupts for any number of reasons. I don't see how an
oddly-specific example (of apparently poor driver design, to boot) is
useful.
> + * In order to catch such cases and not report spurious rocket
> + * job timeouts, synchronize the IRQ handler and re-check the fence
> + * status.
> + */
> + synchronize_irq(core->irq);
> +
> + if (dma_fence_is_signaled(job->done_fence)) {
> + dev_warn(core->dev, "unexpectedly high interrupt latency\n");
> + return DRM_GPU_SCHED_STAT_NOMINAL;
> + }
> +
> + dev_err(core->dev, "gpu sched timeout");
> +
> + atomic_set(&core->reset.pending, 1);
> + rocket_reset(core, sched_job);
> + iommu_detach_group(NULL, iommu_group_get(core->dev));
> +
> + return DRM_GPU_SCHED_STAT_NOMINAL;
> +}
> +
> +static void rocket_reset_work(struct work_struct *work)
> +{
> + struct rocket_core *core;
> +
> + core = container_of(work, struct rocket_core, reset.work);
> + rocket_reset(core, NULL);
> +}
> +
> +static const struct drm_sched_backend_ops rocket_sched_ops = {
> + .run_job = rocket_job_run,
> + .timedout_job = rocket_job_timedout,
> + .free_job = rocket_job_free
> +};
> +
> +static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
> +{
> + struct rocket_core *core = data;
> +
> + rocket_job_handle_irq(core);
> +
> + return IRQ_HANDLED;
> +}
> +
> +static irqreturn_t rocket_job_irq_handler(int irq, void *data)
> +{
> + struct rocket_core *core = data;
> + u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
Given that this can be a shared IRQ as above, it would be a good idea to
take care to avoid register accesses while suspended. Especially if
you're trying to utilise suspend to reset a failing job that may well be
throwing IOMMU faults.
> +
> + WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
> + WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
> +
> + if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> + raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> + return IRQ_NONE;
> +
> + rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
> +
> + return IRQ_WAKE_THREAD;
> +}
> +
> +int rocket_job_init(struct rocket_core *core)
> +{
> + struct drm_sched_init_args args = {
> + .ops = &rocket_sched_ops,
> + .num_rqs = DRM_SCHED_PRIORITY_COUNT,
> + .credit_limit = 1,
Ah, does this mean that all the stuff about queued jobs was in fact all
nonsense anyway?
> + .timeout = msecs_to_jiffies(JOB_TIMEOUT_MS),
> + .name = dev_name(core->dev),
> + .dev = core->dev,
> + };
> + int ret;
> +
> + INIT_WORK(&core->reset.work, rocket_reset_work);
> + spin_lock_init(&core->job_lock);
> +
> + core->irq = platform_get_irq(to_platform_device(core->dev), 0);
> + if (core->irq < 0)
> + return core->irq;
> +
> + ret = devm_request_threaded_irq(core->dev, core->irq,
> + rocket_job_irq_handler,
> + rocket_job_irq_handler_thread,
> + IRQF_SHARED, KBUILD_MODNAME "-job",
Is it really a "job" interrupt though? The binding and the register
definitions suggest it's just a general status interrupt for the core.
Furthermore since we expect to have multiple cores, being able to more
easily identify and attribute per-core IRQ activity seems more useful
for debugging than copy-pasting from something really rather different
which also expects to be the only one of its kind on the system.
Thanks,
Robin.
> + core);
> + if (ret) {
> + dev_err(core->dev, "failed to request job irq");
> + return ret;
> + }
Am Freitag, 6. Juni 2025, 08:28:20 Mitteleuropäische Sommerzeit schrieb Tomeu Vizoso:
> This series adds a new driver for the NPU that Rockchip includes in its
> newer SoCs, developed by them on the NVDLA base.
>
> In its current form, it supports the specific NPU in the RK3588 SoC.
>
> The userspace driver is part of Mesa and an initial draft can be found at:
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29698
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> Nicolas Frattaroli (2):
> arm64: dts: rockchip: add pd_npu label for RK3588 power domains
> arm64: dts: rockchip: enable NPU on ROCK 5B
>
> Tomeu Vizoso (8):
> accel/rocket: Add registers header
> accel/rocket: Add a new driver for Rockchip's NPU
> accel/rocket: Add IOCTL for BO creation
> accel/rocket: Add job submission IOCTL
> accel/rocket: Add IOCTLs for synchronizing memory accesses
> dt-bindings: npu: rockchip,rknn: Add bindings
> arm64: dts: rockchip: Add nodes for NPU and its MMU to rk3588-base
> arm64: dts: rockchip: Enable the NPU on quartzpro64
from a handling point of view, I would expect patch 1 - 6
(driver code + dt-binding patch) to go through some driver tree
but have not clue which one that is.
And afterwards, I would pick up the arm64 devicetree additions
patches 7 - 10 .
Heiko
Am Freitag, 6. Juni 2025, 08:28:20 Mitteleuropäische Sommerzeit schrieb Tomeu Vizoso:
> This series adds a new driver for the NPU that Rockchip includes in its
> newer SoCs, developed by them on the NVDLA base.
>
> In its current form, it supports the specific NPU in the RK3588 SoC.
>
> The userspace driver is part of Mesa and an initial draft can be found at:
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29698
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> Changes in v7:
> - Actually enable process isolation by allocating its own IOMMU domain
> to each DRM client.
> - Link to v6: https://lore.kernel.org/r/20250604-6-10-rocket-v6-0-237ac75ddb5e@tomeuvizos…
I was able to successfully run the SSDLite MobileDet model, detecting
elements correctly on that "Sounds of New York" youtube video all the
demos seem to be using ;-) - on a rk3588-tiger board.
NPU needed like 30ms per frame or so and also detected the expected
things, so
Tested-by: Heiko Stuebner <heiko(a)sntech.de>
We've discussed a number of times of how some heap names are bad, but
not really what makes a good heap name.
Let's document what we expect the heap names to look like.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Documentation/userspace-api/dma-buf-heaps.rst | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 535f49047ce6450796bf4380c989e109355efc05..b24618e360a9a9ba0bd85135d8c1760776f1a37f 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -21,5 +21,24 @@ following heaps:
usually created either through the kernel commandline through the
`cma` parameter, a memory region Device-Tree node with the
`linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or
`CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it
might be called ``reserved``, ``linux,cma``, or ``default-pool``.
+
+Naming Convention
+=================
+
+A good heap name is a name that:
+
+- Is stable, and won't change from one version to the other;
+
+- Describes the memory region the heap will allocate from, and will
+ uniquely identify it in a given platform;
+
+- Doesn't use implementation details, such as the allocator;
+
+- Can describe intended usage.
+
+For example, assuming a platform with a reserved memory region located
+at the RAM address 0x42000000, intended to allocate video framebuffers,
+and backed by the CMA kernel allocator. Good names would be
+`memory@42000000` or `video@42000000`, but `cma-video` wouldn't.
---
base-commit: 92a09c47464d040866cf2b4cd052bc60555185fb
change-id: 20250520-dma-buf-heap-names-doc-31261aa0cfe6
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Fri, Jun 13, 2025 at 09:33:55AM +0000, wangtao wrote:
>
> >
> > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> >
> > Yes.
>
> May I clarify: Do all disk operations require data to pass through
> memory (reading into memory or writing from memory)? In the block layer,
> the bio structure uses bio_iov_iter_get_pages to convert iter_type
> objects into memory-backed bio_vec representations.
> However, some dmabufs are not memory-based, making page-to-bio_vec
> conversion impossible. This suggests adding a callback function in
> dma_buf_ops to handle dmabuf- to-bio_vec conversion.
bios do support PCI P2P tranfers. This could be fairly easily extended
to other peer to peer transfers if we manage to come up with a coherent
model for them. No need for a callback.
On Fri, Jun 13, 2025 at 09:43:08AM +0000, wangtao wrote:
> Here's my analysis on Linux 6.6 with F2FS/iomap.
Linux 6.6 is almost two years old and completely irrelevant. Please
provide numbers on 6.16 or current Linus' tree.
On 6/10/25 18:42, Tvrtko Ursulin wrote:
> Dma-fence objects currently suffer from a potential use after free problem
> where fences exported to userspace and other drivers can outlive the
> exporting driver, or the associated data structures.
>
> The discussion on how to address this concluded that adding reference
> counting to all the involved objects is not desirable, since it would need
> to be very wide reaching and could cause unloadable drivers if another
> entity would be holding onto a signaled fence reference potentially
> indefinitely.
>
> This patch enables the safe access by introducing and documenting a
> contract between fence exporters and users. It documents a set of
> contraints and adds helpers which a) drivers with potential to suffer from
> the use after free must use and b) users of the dma-fence API must use as
> well.
>
> Premise of the design has multiple sides:
>
> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> signalling a fence and freeing the driver private data associated with it.
>
> The grace period does not have to follow the signalling immediately but
> HAS to happen before data is freed.
>
> 2. Users of the dma-fence API marked with such requirement MUST contain
> the complete access to the data within a single code block guarded by
> rcu_read_lock() and rcu_read_unlock().
>
> The combination of the two ensures that whoever sees the
> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> valid fence->lock and valid data potentially accessed by the fence->ops
> virtual functions, until the call to rcu_read_unlock().
>
> 3. Module unload (fence->ops) disappearing is for now explicitly not
> handled. That would required a more complex protection, possibly needing
> SRCU instead of RCU to handle callers such as dma_fence_release() and
> dma_fence_wait_timeout(), where race between
> dma_fence_enable_sw_signaling, signalling, and dereference of
> fence->ops->wait() would need a sleeping SRCU context.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
> ---
> drivers/dma-buf/dma-fence.c | 111 ++++++++++++++++++++++++++++---
> include/linux/dma-fence.h | 31 ++++++---
> include/trace/events/dma_fence.h | 38 +++++++++--
> 3 files changed, 157 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 74f9e4b665e3..3f78c56b58dc 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -511,12 +511,20 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>
> dma_fence_enable_sw_signaling(fence);
>
> - trace_dma_fence_wait_start(fence);
> + if (trace_dma_fence_wait_start_enabled()) {
> + rcu_read_lock();
> + trace_dma_fence_wait_start(fence);
> + rcu_read_unlock();
> + }
> if (fence->ops->wait)
> ret = fence->ops->wait(fence, intr, timeout);
> else
> ret = dma_fence_default_wait(fence, intr, timeout);
> - trace_dma_fence_wait_end(fence);
> + if (trace_dma_fence_wait_end_enabled()) {
> + rcu_read_lock();
> + trace_dma_fence_wait_end(fence);
> + rcu_read_unlock();
> + }
> return ret;
> }
> EXPORT_SYMBOL(dma_fence_wait_timeout);
> @@ -533,16 +541,23 @@ void dma_fence_release(struct kref *kref)
> struct dma_fence *fence =
> container_of(kref, struct dma_fence, refcount);
>
> + rcu_read_lock();
> trace_dma_fence_destroy(fence);
>
> - if (WARN(!list_empty(&fence->cb_list) &&
> - !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags),
> - "Fence %s:%s:%llx:%llx released with pending signals!\n",
> - dma_fence_driver_name(fence),
> - dma_fence_timeline_name(fence),
> - fence->context, fence->seqno)) {
> + if (!list_empty(&fence->cb_list) &&
> + !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
> + const char __rcu *timeline;
> + const char __rcu *driver;
> unsigned long flags;
>
> + driver = dma_fence_driver_name(fence);
> + timeline = dma_fence_timeline_name(fence);
> +
> + WARN(1,
> + "Fence %s:%s:%llx:%llx released with pending signals!\n",
> + rcu_dereference(driver), rcu_dereference(timeline),
> + fence->context, fence->seqno);
> +
> /*
> * Failed to signal before release, likely a refcounting issue.
> *
> @@ -556,6 +571,8 @@ void dma_fence_release(struct kref *kref)
> spin_unlock_irqrestore(fence->lock, flags);
> }
>
> + rcu_read_unlock();
> +
> if (fence->ops->release)
> fence->ops->release(fence);
> else
> @@ -982,11 +999,21 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> */
> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> {
> + const char __rcu *timeline;
> + const char __rcu *driver;
> +
> + rcu_read_lock();
> +
> + timeline = dma_fence_timeline_name(fence);
> + driver = dma_fence_driver_name(fence);
> +
> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> - dma_fence_driver_name(fence),
> - dma_fence_timeline_name(fence),
> + rcu_dereference(driver),
> + rcu_dereference(timeline),
> fence->seqno,
> dma_fence_is_signaled(fence) ? "" : "un");
> +
> + rcu_read_unlock();
> }
> EXPORT_SYMBOL(dma_fence_describe);
>
> @@ -1055,3 +1082,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> BIT(DMA_FENCE_FLAG_SEQNO64_BIT));
> }
> EXPORT_SYMBOL(dma_fence_init64);
> +
> +/**
> + * dma_fence_driver_name - Access the driver name
> + * @fence: the fence to query
> + *
> + * Returns a driver name backing the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &rcu_read_lock and &rcu_read_unlock pair.
> + */
> +const char __rcu *dma_fence_driver_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "RCU protection is required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "detached-driver";
> +}
> +EXPORT_SYMBOL(dma_fence_driver_name);
> +
> +/**
> + * dma_fence_timeline_name - Access the timeline name
> + * @fence: the fence to query
> + *
> + * Returns a timeline name provided by the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &rcu_read_lock and &rcu_read_unlock pair.
> + */
> +const char __rcu *dma_fence_timeline_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "RCU protection is required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "signaled-timeline";
> +}
> +EXPORT_SYMBOL(dma_fence_timeline_name);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index 10a849cb2d3f..64639e104110 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -378,15 +378,28 @@ bool dma_fence_remove_callback(struct dma_fence *fence,
> struct dma_fence_cb *cb);
> void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>
> -static inline const char *dma_fence_driver_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_driver_name(fence);
> -}
> -
> -static inline const char *dma_fence_timeline_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_timeline_name(fence);
> -}
> +/**
> + * DOC: Safe external access to driver provided object members
> + *
> + * All data not stored directly in the dma-fence object, such as the
> + * &dma_fence.lock and memory potentially accessed by functions in the
> + * &dma_fence.ops table, MUST NOT be accessed after the fence has been signalled
> + * because after that point drivers are allowed to free it.
> + *
> + * All code accessing that data via the dma-fence API (or directly, which is
> + * discouraged), MUST make sure to contain the complete access within a
> + * &rcu_read_lock and &rcu_read_unlock pair.
> + *
> + * Some dma-fence API handles this automatically, while other, as for example
> + * &dma_fence_driver_name and &dma_fence_timeline_name, leave that
> + * responsibility to the caller.
> + *
> + * To enable this scheme to work drivers MUST ensure a RCU grace period elapses
> + * between signalling the fence and freeing the said data.
> + *
> + */
> +const char __rcu *dma_fence_driver_name(struct dma_fence *fence);
> +const char __rcu *dma_fence_timeline_name(struct dma_fence *fence);
>
> /**
> * dma_fence_is_signaled_locked - Return an indication if the fence
> diff --git a/include/trace/events/dma_fence.h b/include/trace/events/dma_fence.h
> index 84c83074ee81..4814a65b68dc 100644
> --- a/include/trace/events/dma_fence.h
> +++ b/include/trace/events/dma_fence.h
> @@ -34,14 +34,44 @@ DECLARE_EVENT_CLASS(dma_fence,
> __entry->seqno)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_emit,
> +/*
> + * Safe only for call sites which are guaranteed to not race with fence
> + * signaling,holding the fence->lock and having checked for not signaled, or the
> + * signaling path itself.
> + */
> +DECLARE_EVENT_CLASS(dma_fence_unsignaled,
> +
> + TP_PROTO(struct dma_fence *fence),
> +
> + TP_ARGS(fence),
> +
> + TP_STRUCT__entry(
> + __string(driver, fence->ops->get_driver_name(fence))
> + __string(timeline, fence->ops->get_timeline_name(fence))
> + __field(unsigned int, context)
> + __field(unsigned int, seqno)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(driver);
> + __assign_str(timeline);
> + __entry->context = fence->context;
> + __entry->seqno = fence->seqno;
> + ),
> +
> + TP_printk("driver=%s timeline=%s context=%u seqno=%u",
> + __get_str(driver), __get_str(timeline), __entry->context,
> + __entry->seqno)
> +);
> +
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_emit,
>
> TP_PROTO(struct dma_fence *fence),
>
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_init,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_init,
>
> TP_PROTO(struct dma_fence *fence),
>
> @@ -55,14 +85,14 @@ DEFINE_EVENT(dma_fence, dma_fence_destroy,
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_enable_signal,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_enable_signal,
>
> TP_PROTO(struct dma_fence *fence),
>
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_signaled,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_signaled,
>
> TP_PROTO(struct dma_fence *fence),
>
Dear All,
This patchset fixes the incorrect use of dma_sync_sg_*() calls in
media and related drivers. They are replaced with much safer
dma_sync_sgtable_*() variants, which take care of passing the proper
number of elements for the sync operation.
Best regards
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Change log:
v3: added cc: stable to tags
v2: fixes typos and added cc: stable
Patch summary:
Marek Szyprowski (3):
media: videobuf2: use sgtable-based scatterlist wrappers
udmabuf: use sgtable-based scatterlist wrappers
media: omap3isp: use sgtable-based scatterlist wrappers
drivers/dma-buf/udmabuf.c | 5 ++---
drivers/media/common/videobuf2/videobuf2-dma-sg.c | 4 ++--
drivers/media/platform/ti/omap3isp/ispccdc.c | 8 ++++----
drivers/media/platform/ti/omap3isp/ispstat.c | 6 ++----
4 files changed, 10 insertions(+), 13 deletions(-)
--
2.34.1