On 1/14/26 12:46 AM, Tomeu Vizoso wrote:
> Using the DRM GPU scheduler infrastructure, with a scheduler for each
> core.
>
> Contexts are created in all cores, and buffers mapped to all of them as
> well, so all cores are ready to execute any job.
>
> The job submission code was initially based on Panfrost.
>
> v2:
> - Add thames_accel.h UAPI header (Robert Nelson).
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> drivers/accel/thames/Makefile | 1 +
> drivers/accel/thames/thames_core.c | 6 +
> drivers/accel/thames/thames_drv.c | 19 ++
> drivers/accel/thames/thames_job.c | 463 ++++++++++++++++++++++++++++++++++++
> drivers/accel/thames/thames_job.h | 51 ++++
> drivers/accel/thames/thames_rpmsg.c | 52 ++++
> include/uapi/drm/thames_accel.h | 54 +++++
> 7 files changed, 646 insertions(+)
>
> diff --git a/include/uapi/drm/thames_accel.h b/include/uapi/drm/thames_accel.h
> index 0a5a5e5f6637ab474e9effbb6db29c1dd95e56b5..5b35e50826ed95bfcc3709bef33416d2b6d11c70 100644
> --- a/include/uapi/drm/thames_accel.h
> +++ b/include/uapi/drm/thames_accel.h
> @@ -75,6 +78,55 @@ struct drm_thames_bo_mmap_offset {
> __u64 offset;
> };
>
> +/**
> + * struct drm_thames_job - A job to be run on the NPU
> + *
> + * The kernel will schedule the execution of this job taking into account its
> + * dependencies with other jobs. All tasks in the same job will be executed
> + * sequentially on the same core, to benefit from memory residency in SRAM.
> + */
Please make these comments full-fledged kernel-doc comments.
E.g.:
> +struct drm_thames_job {
> + /** Input: BO handle for kernel. */
/** @kernel: input: BO handle for kernel. */
> + __u32 kernel;
> +
> + /** Input: Size in bytes of the compiled kernel. */
> + __u32 kernel_size;
> +
> + /** Input: BO handle for params BO. */
> + __u32 params;
> +
> + /** Input: Size in bytes of the params BO. */
> + __u32 params_size;
> +
> + /** Input: Pointer to a u32 array of the BOs that are read by the job. */
> + __u64 in_bo_handles;
> +
> + /** Input: Pointer to a u32 array of the BOs that are written to by the job. */
> + __u64 out_bo_handles;
> +
> + /** Input: Number of input BO handles passed in (size is that times 4). */
> + __u32 in_bo_handle_count;
> +
> + /** Input: Number of output BO handles passed in (size is that times 4). */
> + __u32 out_bo_handle_count;
> +};
> +
> +/**
> + * struct drm_thames_submit - ioctl argument for submitting commands to the NPU.
> + *
> + * The kernel will schedule the execution of these jobs in dependency order.
> + */
Same here.
> +struct drm_thames_submit {
> + /** Input: Pointer to an array of struct drm_thames_job. */
> + __u64 jobs;
> +
> + /** Input: Number of jobs passed in. */
> + __u32 job_count;
> +
> + /** Reserved, must be zero. */
> + __u32 pad;
> +};
> +
--
~Randy
On 1/14/26 2:46 AM, Tomeu Vizoso wrote:
> This memory region is used by the DRM/accel driver to allocate addresses
> for buffers that are used for communication with the DSP cores and for
> their intermediate results.
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> arch/arm64/boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi b/arch/arm64/boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi
> index 3fbff927c4c08bce741555aa2753a394b751144f..b80d2a5a157ad59eaed8e57b22f1f4bce4765a85 100644
> --- a/arch/arm64/boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi
> +++ b/arch/arm64/boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi
> @@ -42,6 +42,11 @@ c7x_0_memory_region: memory@a3100000 {
> no-map;
> };
>
> + c7x_iova_pool: iommu-pool@a7000000 {
> + reg = <0x00 0xa7000000 0x00 0x18200000>;
> + no-map;
Could you expand on why this carveout is needed? The C7 NPU has a full
MMU and should be able to work with any buffer Linux allocates from any
address, even non-contiguous buffers too.
Communication should already happen over the existing RPMSG channels
without needing extra buffers. And space for intermediate results
should be provided dynamically by the drivers (I believe that would
match how GPUs without dedicated memory handle getting intermediate
buffers space from system memory these days, but do correct me if
I'm wrong about that one).
Andrew
> + };
> +
> c7x_1_dma_memory_region: memory@a4000000 {
> compatible = "shared-dma-pool";
> reg = <0x00 0xa4000000 0x00 0x100000>;
> @@ -151,13 +156,15 @@ &main_r5fss0_core0 {
> &c7x_0 {
> mboxes = <&mailbox0_cluster2 &mbox_c7x_0>;
> memory-region = <&c7x_0_dma_memory_region>,
> - <&c7x_0_memory_region>;
> + <&c7x_0_memory_region>,
> + <&c7x_iova_pool>;
> status = "okay";
> };
>
> &c7x_1 {
> mboxes = <&mailbox0_cluster3 &mbox_c7x_1>;
> memory-region = <&c7x_1_dma_memory_region>,
> - <&c7x_1_memory_region>;
> + <&c7x_1_memory_region>,
> + <&c7x_iova_pool>;
> status = "okay";
> };
>
On Tue, 13 Jan 2026, Tomeu Vizoso <tomeu(a)tomeuvizoso.net> wrote:
> +#include "linux/dev_printk.h"
Random drive-by comment, please use <> instead of "" for include/
headers.
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
> +#include <drm/drm_print.h>
> +#include <drm/thames_accel.h>
> +#include <linux/platform_device.h>
In general, I think it will make everyone's life easier in the long run
if the include directives are grouped and sorted.
BR,
Jani.
--
Jani Nikula, Intel
On 1/14/26 10:53, Tvrtko Ursulin wrote:
> \
> On 13/01/2026 15:16, Christian König wrote:
>> Some driver use fence->ops to test if a fence was initialized or not.
>> The problem is that this utilizes internal behavior of the dma_fence
>> implementation.
>>
>> So better abstract that into a function.
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 13 +++++++------
>> drivers/gpu/drm/qxl/qxl_release.c | 2 +-
>> include/linux/dma-fence.h | 12 ++++++++++++
>> 3 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 0a0dcbf0798d..b97f90bbe8b9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -278,9 +278,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>> unsigned i;
>> /* Check if any fences were initialized */
>> - if (job->base.s_fence && job->base.s_fence->finished.ops)
>> + if (job->base.s_fence &&
>> + dma_fence_is_initialized(&job->base.s_fence->finished))
>> f = &job->base.s_fence->finished;
>> - else if (job->hw_fence && job->hw_fence->base.ops)
>> + else if (dma_fence_is_initialized(&job->hw_fence->base))
>> f = &job->hw_fence->base;
>> else
>> f = NULL;
>> @@ -297,11 +298,11 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>> amdgpu_sync_free(&job->explicit_sync);
>> - if (job->hw_fence->base.ops)
>> + if (dma_fence_is_initialized(&job->hw_fence->base))
>> dma_fence_put(&job->hw_fence->base);
>> else
>> kfree(job->hw_fence);
>> - if (job->hw_vm_fence->base.ops)
>> + if (dma_fence_is_initialized(&job->hw_vm_fence->base))
>> dma_fence_put(&job->hw_vm_fence->base);
>> else
>> kfree(job->hw_vm_fence);
>> @@ -335,11 +336,11 @@ void amdgpu_job_free(struct amdgpu_job *job)
>> if (job->gang_submit != &job->base.s_fence->scheduled)
>> dma_fence_put(job->gang_submit);
>> - if (job->hw_fence->base.ops)
>> + if (dma_fence_is_initialized(&job->hw_fence->base))
>> dma_fence_put(&job->hw_fence->base);
>> else
>> kfree(job->hw_fence);
>> - if (job->hw_vm_fence->base.ops)
>> + if (dma_fence_is_initialized(&job->hw_vm_fence->base))
>> dma_fence_put(&job->hw_vm_fence->base);
>> else
>> kfree(job->hw_vm_fence);
>> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
>> index 7b3c9a6016db..b38ae0b25f3c 100644
>> --- a/drivers/gpu/drm/qxl/qxl_release.c
>> +++ b/drivers/gpu/drm/qxl/qxl_release.c
>> @@ -146,7 +146,7 @@ qxl_release_free(struct qxl_device *qdev,
>> idr_remove(&qdev->release_idr, release->id);
>> spin_unlock(&qdev->release_idr_lock);
>> - if (release->base.ops) {
>> + if (dma_fence_is_initialized(&release->base)) {
>> WARN_ON(list_empty(&release->bos));
>> qxl_release_free_list(release);
>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>> index eea674acdfa6..371aa8ecf18e 100644
>> --- a/include/linux/dma-fence.h
>> +++ b/include/linux/dma-fence.h
>> @@ -274,6 +274,18 @@ void dma_fence_release(struct kref *kref);
>> void dma_fence_free(struct dma_fence *fence);
>> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq);
>> +/**
>> + * dma_fence_is_initialized - test if fence was initialized
>> + * @fence: fence to test
>> + *
>> + * Return: True if fence was initialized, false otherwise. Works correctly only
>> + * when memory backing the fence structure is zero initialized on allocation.
>> + */
>> +static inline bool dma_fence_is_initialized(struct dma_fence *fence)
>> +{
>> + return fence && !!fence->ops;
>
> This patch should precede the one adding RCU protection to fence->ops. And that one then needs to add a rcu_dereference() here.
Good point.
> At which point however it would start exploding?
When we start setting the ops pointer to NULL in the next patch.
> Which also means the new API is racy by definition and can give false positives if fence would be to be signaled as someone is checking.
Oh, that is a really really good point. I haven't thought about that because all current users would check the fence only after it is signaled.
> Hmm.. is the new API too weak, being able to only be called under very limited circumstances?
Yes, exactly that. All callers use this only to decide on the correct cleanup path.
So the fence is either fully signaled or was never initialized in the first place.
> Would it be better to solve it in the drivers by tracking state?
The alternative I had in mind was to use another DMA_FENCE_FLAG_... for that.
I will probably use that approach instead, just to make it extra defensive.
Thanks,
Christian.
>
> Regards,
>
> Tvrtko
>
>> +}
>> +
>> /**
>> * dma_fence_put - decreases refcount of the fence
>> * @fence: fence to reduce refcount of
>
On Wed, Oct 29, 2025 at 07:07:42PM +0100, Neil Armstrong wrote:
> The I2C Hub controller is a simpler GENI I2C variant that doesn't
> support DMA at all, add a no_dma flag to make sure it nevers selects
> the SE DMA mode with mappable 32bytes long transfers.
>
> Fixes: cacd9643eca7 ("i2c: qcom-geni: add support for I2C Master Hub variant")
> Signed-off-by: Neil Armstrong <neil.armstrong(a)linaro.org>
> Reviewed-by: Konrad Dybcio <konrad.dybcio(a)oss.qualcomm.com>
> Reviewed-by: Mukesh Kumar Savaliya <mukesh.savaliya(a)oss.qualcomm.com>>
Applied to for-current, thanks!
On 1/13/26 17:12, Philipp Stanner wrote:
> On Tue, 2026-01-13 at 16:16 +0100, Christian König wrote:
>> Using the inline lock is now the recommended way for dma_fence implementations.
>>
>> For the scheduler fence use the inline lock for the scheduled fence part
>> and then the lock from the scheduled fence as external lock for the finished fence.
>>
>> This way there is no functional difference, except for saving the space
>> for the separate lock.
>>
>> v2: re-work the patch to avoid any functional difference
>
> *cough cough*
>
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +++---
>> include/drm/gpu_scheduler.h | 4 ----
>> 2 files changed, 3 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index 724d77694246..112677231f9a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -217,7 +217,6 @@ struct drm_sched_fence *drm_sched_fence_alloc(struct drm_sched_entity *entity,
>>
>> fence->owner = owner;
>> fence->drm_client_id = drm_client_id;
>> - spin_lock_init(&fence->lock);
>>
>> return fence;
>> }
>> @@ -230,9 +229,10 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>> fence->sched = entity->rq->sched;
>> seq = atomic_inc_return(&entity->fence_seq);
>> dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>> - &fence->lock, entity->fence_context, seq);
>> + NULL, entity->fence_context, seq);
>> dma_fence_init(&fence->finished, &drm_sched_fence_ops_finished,
>> - &fence->lock, entity->fence_context + 1, seq);
>> + dma_fence_spinlock(&fence->scheduled),
>
> I think while you are correct that this is no functional difference, it
> is still a bad idea which violates the entire idea of your series:
>
> All fences are now independent from each other and the fence context –
> except for those two.
>
> Some fences are more equal than others ;)
Yeah, I was going back and forth once more if I should keep this patch at all or just drop it.
> By implementing this, you would also show to people browsing the code
> that it can be a good idea or can be done to have fences share locks.
> Do you want that?
Good question. For almost all cases we don't want this, but once more the scheduler is special.
In the scheduler we have two fences in one, the scheduled one and the finished one.
So here it technically makes sense to have this construct to be defensive.
But on the other hand it has no practical value because it still doesn't allow us to unload the scheduler module. We would need a much wider rework for being able to do that.
So maybe I should just really drop this patch or at least keep it back until we had time to figure out what the next steps are.
> As far as I have learned from you and our discussions, that would be a
> very bombastic violation of the sacred "dma-fence-rules".
Well using the inline fence is "only" a strong recommendation. It's not as heavy as the signaling rules because when you mess up those you can easily kill the whole system.
> I believe it's definitely worth sacrificing some bytes so that those
> two fences get fully decoupled. Who will have it on their radar that
> they are special? Think about future reworks.
This doesn't even save any bytes, my thinking was more that this is the more defensive approach should anybody use the spinlock pointer from the scheduler fence to do some locking.
> Besides that, no objections from my side.
Thanks,
Christian.
>
>
> P.
>
>> + entity->fence_context + 1, seq);
>> }
>>
>> module_init(drm_sched_fence_slab_init);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 78e07c2507c7..ad3704685163 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -297,10 +297,6 @@ struct drm_sched_fence {
>> * belongs to.
>> */
>> struct drm_gpu_scheduler *sched;
>> - /**
>> - * @lock: the lock used by the scheduled and the finished fences.
>> - */
>> - spinlock_t lock;
>> /**
>> * @owner: job owner for debugging
>> */
>
On 1/13/26 22:32, Eric Chanudet wrote:
> The system dma-buf heap lets userspace allocate buffers from the page
> allocator. However, these allocations are not accounted for in memcg,
> allowing processes to escape limits that may be configured.
>
> Pass __GFP_ACCOUNT for system heap allocations, based on the
> dma_heap.mem_accounting parameter, to use memcg and account for them.
>
> Signed-off-by: Eric Chanudet <echanude(a)redhat.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index 4c782fe33fd497a74eb5065797259576f9b651b6..139b50df64ed4c4a6fdd69f25fe48324fbe2c481 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -52,6 +52,8 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
> static const unsigned int orders[] = {8, 4, 0};
> #define NUM_ORDERS ARRAY_SIZE(orders)
>
> +extern bool mem_accounting;
Please define that in some header. Apart from that looks good technically.
But after the discussion it sounds more and more like we don't want to account device driver allocated memory in memcg at all.
Regards,
Christian.
> +
> static int dup_sg_table(struct sg_table *from, struct sg_table *to)
> {
> struct scatterlist *sg, *new_sg;
> @@ -320,14 +322,17 @@ static struct page *alloc_largest_available(unsigned long size,
> {
> struct page *page;
> int i;
> + gfp_t flags;
>
> for (i = 0; i < NUM_ORDERS; i++) {
> if (size < (PAGE_SIZE << orders[i]))
> continue;
> if (max_order < orders[i])
> continue;
> -
> - page = alloc_pages(order_flags[i], orders[i]);
> + flags = order_flags[i];
> + if (mem_accounting)
> + flags |= __GFP_ACCOUNT;
> + page = alloc_pages(flags, orders[i]);
> if (!page)
> continue;
> return page;
>
Hi everyone,
dma_fences have ever lived under the tyranny dictated by the module
lifetime of their issuer, leading to crashes should anybody still holding
a reference to a dma_fence when the module of the issuer was unloaded.
The basic problem is that when buffer are shared between drivers
dma_fence objects can leak into external drivers and stay there even
after they are signaled. The dma_resv object for example only lazy releases
dma_fences.
So what happens is that when the module who originally created the dma_fence
unloads the dma_fence_ops function table becomes unavailable as well and so
any attempt to release the fence crashes the system.
Previously various approaches have been discussed, including changing the
locking semantics of the dma_fence callbacks (by me) as well as using the
drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
from their actual users, but none of them are actually solving all problems.
Tvrtko did some really nice prerequisite work by protecting the returned
strings of the dma_fence_ops by RCU. This way dma_fence creators where
able to just wait for an RCU grace period after fence signaling before
they could be save to free those data structures.
Now this patch set here goes a step further and protects the whole
dma_fence_ops structure by RCU, so that after the fence signals the
pointer to the dma_fence_ops is set to NULL when there is no wait nor
release callback given. All functionality which use the dma_fence_ops
reference are put inside an RCU critical section, except for the
deprecated issuer specific wait and of course the optional release
callback.
Additional to the RCU changes the lock protecting the dma_fence state
previously had to be allocated external. This set here now changes the
functionality to make that external lock optional and allows dma_fences
to use an inline lock and be self contained.
v4:
Rebases the whole set on upstream changes, especially the cleanup
from Philip in patch "drm/amdgpu: independence for the amdkfd_fence!".
Adding two patches which brings the DMA-fence self tests up to date.
The first selftest changes removes the mock_wait and so actually starts
testing the default behavior instead of some hacky implementation in the
test. This one got upstreamed independent of this set.
The second drops the mock_fence as well and tests the new RCU and inline
spinlock functionality.
v5:
Rebase on top of drm-misc-next instead of drm-tip, leave out all driver
changes for now since those should go through the driver specific paths
anyway.
Address a few more review comments, especially some rebase mess and
typos. And finally fix one more bug found by AMDs CI system.
Especially the first patch still needs a Reviewed-by, apart from that I
think I've addressed all review comments and problems.
Please review and comment,
Christian.
On 1/13/26 18:44, Tomeu Vizoso wrote:
> This series adds a new DRM/Accel driver that supports the C7x DSPs
> inside some Texas Instruments SoCs such as the J722S. These can be used
> as accelerators for various workloads, including machine learning
> inference.
>
> This driver controls the power state of the hardware via remoteproc and
> communicates with the firmware running on the DSP via rpmsg_virtio. The
> kernel driver itself allocates buffers, manages contexts, and submits
> jobs to the DSP firmware. Buffers are mapped by the DSP itself using its
> MMU, providing memory isolation among different clients.
>
> The source code for the firmware running on the DSP is available at:
> https://gitlab.freedesktop.org/tomeu/thames_firmware/.
>
> Everything else is done in userspace, as a Gallium driver (also called
> thames) that is part of the Mesa3D project: https://docs.mesa3d.org/teflon.html
>
> If there is more than one core that advertises the same rpmsg_virtio
> service name, the driver will load balance jobs between them with
> drm-gpu-scheduler.
I only took 5 minutes to skim over it, so no full review.
You have the classic mistake of allocating memory in the run_job callback of the scheduler, but that is trivial to fix.
Apart from that looks pretty solid to me.
Regards,
Christian.
>
> Userspace portion of the driver: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39298
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> Tomeu Vizoso (5):
> arm64: dts: ti: k3-j722s-ti-ipc-firmware: Add memory pool for DSP i/o buffers
> accel/thames: Add driver for the C7x DSPs in TI SoCs
> accel/thames: Add IOCTLs for BO creation and mapping
> accel/thames: Add IOCTL for job submission
> accel/thames: Add IOCTL for memory synchronization
>
> Documentation/accel/thames/index.rst | 28 ++
> MAINTAINERS | 9 +
> .../boot/dts/ti/k3-j722s-ti-ipc-firmware.dtsi | 11 +-
> drivers/accel/Kconfig | 1 +
> drivers/accel/Makefile | 3 +-
> drivers/accel/thames/Kconfig | 26 ++
> drivers/accel/thames/Makefile | 11 +
> drivers/accel/thames/thames_core.c | 161 +++++++
> drivers/accel/thames/thames_core.h | 53 +++
> drivers/accel/thames/thames_device.c | 93 +++++
> drivers/accel/thames/thames_device.h | 46 ++
> drivers/accel/thames/thames_drv.c | 180 ++++++++
> drivers/accel/thames/thames_drv.h | 21 +
> drivers/accel/thames/thames_gem.c | 407 ++++++++++++++++++
> drivers/accel/thames/thames_gem.h | 45 ++
> drivers/accel/thames/thames_ipc.h | 204 +++++++++
> drivers/accel/thames/thames_job.c | 463 +++++++++++++++++++++
> drivers/accel/thames/thames_job.h | 51 +++
> drivers/accel/thames/thames_rpmsg.c | 276 ++++++++++++
> drivers/accel/thames/thames_rpmsg.h | 27 ++
> 20 files changed, 2113 insertions(+), 3 deletions(-)
> ---
> base-commit: 27927a79b3c6aebd18f38507a8160294243763dc
> change-id: 20260113-thames-334127a2d91d
>
> Best regards,
On Tue, Jan 13, 2026 at 11:45 AM Tomeu Vizoso <tomeu(a)tomeuvizoso.net> wrote:
>
> Some SoCs from Texas Instruments contain DSPs that can be used for
> general compute tasks.
>
> This driver provides a drm/accel UABI to userspace for submitting jobs
> to the DSP cores and managing the input, output and intermediate memory.
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> Documentation/accel/thames/index.rst | 28 +++++
> MAINTAINERS | 9 ++
> drivers/accel/Kconfig | 1 +
> drivers/accel/Makefile | 3 +-
> drivers/accel/thames/Kconfig | 26 +++++
> drivers/accel/thames/Makefile | 9 ++
> drivers/accel/thames/thames_core.c | 155 ++++++++++++++++++++++++++
> drivers/accel/thames/thames_core.h | 53 +++++++++
> drivers/accel/thames/thames_device.c | 93 ++++++++++++++++
> drivers/accel/thames/thames_device.h | 46 ++++++++
> drivers/accel/thames/thames_drv.c | 156 +++++++++++++++++++++++++++
> drivers/accel/thames/thames_drv.h | 21 ++++
> drivers/accel/thames/thames_ipc.h | 204 +++++++++++++++++++++++++++++++++++
> drivers/accel/thames/thames_rpmsg.c | 155 ++++++++++++++++++++++++++
> drivers/accel/thames/thames_rpmsg.h | 27 +++++
> 15 files changed, 985 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/accel/thames/index.rst b/Documentation/accel/thames/index.rst
> new file mode 100644
> index 0000000000000000000000000000000000000000..ca8391031f226f7ef1dc210a356c86acbe126c6f
> --- /dev/null
> +++ b/Documentation/accel/thames/index.rst
> @@ -0,0 +1,28 @@
> +.. SPDX-License-Identifier: GPL-2.0-only
> +
> +============================================================
> + accel/thames Driver for the C7x DSPs from Texas Instruments
> +============================================================
> +
> +The accel/thames driver supports the C7x DSPs inside some Texas Instruments SoCs
> +such as the J722S. These can be used as accelerators for various workloads,
> +including machine learning inference.
> +
> +This driver controls the power state of the hardware via :doc:`remoteproc </staging/remoteproc>`
> +and communicates with the firmware running on the DSP via :doc:`rpmsg_virtio </staging/rpmsg_virtio>`.
> +The kernel driver itself allocates buffers, manages contexts, and submits jobs
> +to the DSP firmware. Buffers are mapped by the DSP itself using its MMU,
> +providing memory isolation among different clients.
> +
> +The source code for the firmware running on the DSP is available at:
> +https://gitlab.freedesktop.org/tomeu/thames_firmware/.
> +
> +Everything else is done in userspace, as a Gallium driver (also called thames)
> +that is part of the Mesa3D project: https://docs.mesa3d.org/teflon.html
> +
> +If there is more than one core that advertises the same rpmsg_virtio service
> +name, the driver will load balance jobs between them with drm-gpu-scheduler.
> +
> +Hardware currently supported:
> +
> +* J722S
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dc731d37c8feeff25613c59fe9c929927dadaa7e..a3fc809c797269d0792dfe5202cc1b49f6ff57e9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7731,6 +7731,15 @@ F: Documentation/devicetree/bindings/npu/rockchip,rk3588-rknn-core.yaml
> F: drivers/accel/rocket/
> F: include/uapi/drm/rocket_accel.h
>
> +DRM ACCEL DRIVER FOR TI C7x DSPS
> +M: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> +L: dri-devel(a)lists.freedesktop.org
> +S: Supported
> +T: git https://gitlab.freedesktop.org/drm/misc/kernel.git
> +F: Documentation/accel/thames/
> +F: drivers/accel/thames/
> +F: include/uapi/drm/thames_accel.h
Oh where is this "thames_accel.h" ? ;)
2026-01-13T18:16:11.881084Z 01E
drivers/accel/thames/thames_drv.c:8:10: fatal error:
drm/thames_accel.h: No such file or directory
2026-01-13T18:16:11.881086Z 01E 8 | #include <drm/thames_accel.h>
2026-01-13T18:16:11.881087Z 01E | ^~~~~~~~~~~~~~~~~~~~
2026-01-13T18:16:11.881115Z 01E compilation terminated.
2026-01-13T18:16:11.884552Z 01E make[8]: ***
[scripts/Makefile.build:287: drivers/accel/thames/thames_drv.o] Error
1
2026-01-13T18:16:11.884694Z 01E make[7]: ***
[scripts/Makefile.build:544: drivers/accel/thames] Error 2
2026-01-13T18:16:11.884926Z 01E make[6]: ***
[scripts/Makefile.build:544: drivers/accel] Error 2
2026-01-13T18:16:11.884976Z 01E make[6]: *** Waiting for unfinished jobs....
$ find . | grep thames_accel.h
$ grep -R "thames_accel.h" ./*
./drivers/accel/thames/Kconfig: include/uapi/drm/thames_accel.h
and is used by the Thames userspace
./drivers/accel/thames/thames_job.c:#include <drm/thames_accel.h>
./drivers/accel/thames/thames_drv.c:#include <drm/thames_accel.h>
./drivers/accel/thames/thames_gem.c:#include <drm/thames_accel.h>
./MAINTAINERS:F: include/uapi/drm/thames_accel.h
Regards,
--
Robert Nelson
https://rcn-ee.com/
This series implements a dma-buf “revoke” mechanism: to allow a dma-buf
exporter to explicitly invalidate (“kill”) a shared buffer after it has
been distributed to importers, so that further CPU and device access is
prevented and importers reliably observe failure.
Today, dma-buf effectively provides “if you have the fd, you can keep using
the memory indefinitely.” That assumption breaks down when an exporter must
reclaim, reset, evict, or otherwise retire backing memory after it has been
shared. Concrete cases include GPU reset and recovery where old allocations
become unsafe to access, memory eviction/overcommit where backing storage
must be withdrawn, and security or isolation situations where continued access
must be prevented. While drivers can sometimes approximate this with
exporter-specific fencing and policy, there is no core dma-buf state transition
that communicates “this buffer is no longer valid; fail access” across all
access paths.
The change in this series is to introduce a core “revoked” state on the dma-buf
object and a corresponding exporter-triggered revoke operation. Once a dma-buf
is revoked, new access paths are blocked so that attempts to DMA-map, vmap, or
mmap the buffer fail in a consistent way.
In addition, the series aims to invalidate existing access as much as the kernel
allows: device mappings are torn down where possible so devices and IOMMUs cannot
continue DMA.
The semantics are intentionally simple: revoke is a one-way, permanent transition
for the lifetime of that dma-buf instance.
From a compatibility perspective, users that never invoke revoke are unaffected,
and exporters that adopt it gain a core-supported enforcement mechanism rather
than relying on ad hoc driver behavior. The intent is to keep the interface
minimal and avoid imposing policy; the series provides the mechanism to terminate
access, with policy remaining in the exporter and higher-level components.
BTW, see this megathread [1] for additional context.
Ironically, it was posted exactly one year ago.
[1] https://lore.kernel.org/all/20250107142719.179636-2-yilun.xu@linux.intel.co…
Thanks
Cc: linux-rdma(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-media(a)vger.kernel.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: kvm(a)vger.kernel.org
Cc: iommu(a)lists.linux.dev
To: Jason Gunthorpe <jgg(a)ziepe.ca>
To: Leon Romanovsky <leon(a)kernel.org>
To: Sumit Semwal <sumit.semwal(a)linaro.org>
To: Christian König <christian.koenig(a)amd.com>
To: Alex Williamson <alex(a)shazbot.org>
To: Kevin Tian <kevin.tian(a)intel.com>
To: Joerg Roedel <joro(a)8bytes.org>
To: Will Deacon <will(a)kernel.org>
To: Robin Murphy <robin.murphy(a)arm.com>
Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
---
Leon Romanovsky (4):
dma-buf: Introduce revoke semantics
vfio: Use dma-buf revoke semantics
iommufd: Require DMABUF revoke semantics
iommufd/selftest: Reuse dma-buf revoke semantics
drivers/dma-buf/dma-buf.c | 36 ++++++++++++++++++++++++++++++++----
drivers/iommu/iommufd/pages.c | 2 +-
drivers/iommu/iommufd/selftest.c | 12 ++++--------
drivers/vfio/pci/vfio_pci_dmabuf.c | 27 ++++++---------------------
include/linux/dma-buf.h | 31 +++++++++++++++++++++++++++++++
5 files changed, 74 insertions(+), 34 deletions(-)
---
base-commit: 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb
change-id: 20251221-dmabuf-revoke-b90ef16e4236
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
On Fri, Jan 09, 2026 at 10:10:57AM +0800, Ming Lei wrote:
> On Thu, Jan 08, 2026 at 11:17:03AM +0100, Christoph Hellwig wrote:
> > On Thu, Jan 08, 2026 at 10:19:18AM +0800, Ming Lei wrote:
> > > > The feature is in no way nvme specific. nvme is just the initial
> > > > underlying driver. It makes total sense to support this for any high
> > > > performance block device, and to pass it through file systems.
> > >
> > > But why does FS care the dma buffer attachment? Since high performance
> > > host controller is exactly the dma buffer attachment point.
> >
> > I can't parse what you're trying to say here.
>
> dma buffer attachment is simply none of FS's business.
The file systems should indeed never do a dma buffer attachment itself,
but that's not the point.
> > But even when not stacking, the registration still needs to go
> > through the file system even for a single device, never mind multiple
> > controlled by the file system.
>
> dma_buf can have multiple importers, so why does it have to go through FS for
> single device only?
>
> If the registered buffer is attached to single device before going
> through FS, it can not support stacking block device, and it can't or not
> easily to use for multiple block device, no matter if they are behind same
> host controller or multiple.
Because the file system, or the file_operations instance to be more
specific, is the only entity that known what block device(s) or other DMA
capable device(s) like (R)NIC a file maps to.
On Thu, Jan 08, 2026 at 10:19:18AM +0800, Ming Lei wrote:
> > The feature is in no way nvme specific. nvme is just the initial
> > underlying driver. It makes total sense to support this for any high
> > performance block device, and to pass it through file systems.
>
> But why does FS care the dma buffer attachment? Since high performance
> host controller is exactly the dma buffer attachment point.
I can't parse what you're trying to say here.
> If the callback is added in `struct file_operations` for wiring dma buffer
> and the importer(host contrller), you will see it is hard to let it cross device
> mapper/raid or other stackable block devices.
Why?
But even when not stacking, the registration still needs to go
through the file system even for a single device, never mind multiple
controlled by the file system.
On 1/4/26 02:42, Ming Lei wrote:
> On Thu, Dec 04, 2025 at 02:10:25PM +0100, Christoph Hellwig wrote:
>> On Thu, Dec 04, 2025 at 12:09:46PM +0100, Christian König wrote:
>>>> I find the naming pretty confusing a well. But what this does is to
>>>> tell the file system/driver that it should expect a future
>>>> read_iter/write_iter operation that takes data from / puts data into
>>>> the dmabuf passed to this operation.
>>>
>>> That explanation makes much more sense.
>>>
>>> The remaining question is why does the underlying file system / driver
>>> needs to know that it will get addresses from a DMA-buf?
>>
>> This eventually ends up calling dma_buf_dynamic_attach and provides
>> a way to find the dma_buf_attachment later in the I/O path.
>
> Maybe it can be named as ->dma_buf_attach()? For wiring dma-buf and the
> importer side(nvme).
Yeah that would make it much more cleaner.
Also some higher level documentation would certainly help.
> But I am wondering why not make it as one subsystem interface, such as nvme
> ioctl, then the whole implementation can be simplified a lot. It is reasonable
> because subsystem is exactly the side for consuming/importing the dma-buf.
Yeah that it might be better if it's more nvme specific came to me as well.
Regards,
Christian.
>
>
> Thanks,
> Ming
>
On 12/19/25 16:58, Maxime Ripard wrote:
> On Fri, Dec 19, 2025 at 02:50:50PM +0100, Christian König wrote:
>> On 12/19/25 11:25, Maxime Ripard wrote:
>>> On Mon, Dec 15, 2025 at 03:53:22PM +0100, Christian König wrote:
>>>> On 12/15/25 14:59, Maxime Ripard wrote:
>> ...
>>>>>>> The shared ownership is indeed broken, but it's not more or less broken
>>>>>>> than, say, memfd + udmabuf, and I'm sure plenty of others.
>>>>>>>
>>>>>>> So we really improve the common case, but only make the "advanced"
>>>>>>> slightly more broken than it already is.
>>>>>>>
>>>>>>> Would you disagree?
>>>>>>
>>>>>> I strongly disagree. As far as I can see there is a huge chance we
>>>>>> break existing use cases with that.
>>>>>
>>>>> Which ones? And what about the ones that are already broken?
>>>>
>>>> Well everybody that expects that driver resources are *not* accounted to memcg.
>>>
>>> Which is a thing only because these buffers have never been accounted
>>> for in the first place.
>>
>> Yeah, completely agree. By not accounting it for such a long time we
>> ended up with people depending on this behavior.
>>
>> Not nice, but that's what it is.
>>
>>> So I guess the conclusion is that we shouldn't
>>> even try to do memory accounting, because someone somewhere might not
>>> expect that one of its application would take too much RAM in the
>>> system?
>>
>> Well we do need some kind of solution to the problem. Either having
>> some setting where you say "This memcg limit is inclusive/exclusive
>> device driver allocated memory" or have a completely separate limit
>> for device driver allocated memory.
>
> A device driver memory specific limit sounds like a good idea because it
> would make it easier to bridge the gap with dmem.
Completely agree, but that approach was rejected by the cgroups people.
I mean we can already use udmabuf to allocate memcg accounted system memory which then can be imported into device drivers.
So I don't see much reason why we should account dma-buf heaps and driver interfaces to memcg as well, we just need some way to limit them.
Regards,
Christian.
>
> Happy holidays,
> Maxime
On Tue, Jan 06, 2026 at 07:51:12PM +0000, Pavel Begunkov wrote:
>> But I am wondering why not make it as one subsystem interface, such as nvme
>> ioctl, then the whole implementation can be simplified a lot. It is reasonable
>> because subsystem is exactly the side for consuming/importing the dma-buf.
>
> It's not an nvme specific interface, and so a file op was much more
> convenient.
It is the much better abstraction. Also the nvme subsystems is not
an actor, and registering things to the subsystems does not work.
The nvme controller is the entity that does the dma mapping, and this
interface works very well for that.
On Fri, Dec 19, 2025 at 7:19 PM Maxime Ripard <mripard(a)redhat.com> wrote:
>
> Hi,
>
> On Tue, Dec 16, 2025 at 11:06:59AM +0900, T.J. Mercier wrote:
> > On Mon, Dec 15, 2025 at 7:51 PM Maxime Ripard <mripard(a)redhat.com> wrote:
> > > On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> > > > On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude(a)redhat.com> wrote:
> > > > >
> > > > > The system dma-buf heap lets userspace allocate buffers from the page
> > > > > allocator. However, these allocations are not accounted for in memcg,
> > > > > allowing processes to escape limits that may be configured.
> > > > >
> > > > > Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> > > >
> > > > We had a discussion just last night in the MM track at LPC about how
> > > > shared memory accounted in memcg is pretty broken. Without a way to
> > > > identify (and possibly transfer) ownership of a shared buffer, this
> > > > makes the accounting of shared memory, and zombie memcg problems
> > > > worse. :\
> > >
> > > Are there notes or a report from that discussion anywhere?
> >
> > The LPC vids haven't been clipped yet, and actually I can't even find
> > the recorded full live stream from Hall A2 on the first day. So I
> > don't think there's anything to look at, but I bet there's probably
> > nothing there you don't already know.
>
> Ack, thanks for looking at it still :)
>
> > > The way I see it, the dma-buf heaps *trivial* case is non-existent at
> > > the moment and that's definitely broken. Any application can bypass its
> > > cgroups limits trivially, and that's a pretty big hole in the system.
> >
> > Agree, but if we only charge the first allocator then limits can still
> > easily be bypassed assuming an app can cause an allocation outside of
> > its cgroup tree.
> >
> > I'm not sure using static memcg limits where a significant portion of
> > the memory can be shared is really feasible. Even with just pagecache
> > being charged to memcgs, we're having trouble defining a static memcg
> > limit that is really useful since it has to be high enough to
> > accomodate occasional spikes due to shared memory that might or might
> > not be charged (since it can only be charged to one memcg - it may be
> > spread around or it may all get charged to one memcg). So excessive
> > anonymous use has to get really bad before it gets punished.
> >
> > What I've been hearing lately is that folks are polling memory.stat or
> > PSI or other metrics and using that to take actions (memory.reclaim /
> > killing / adjust memory.high) at runtime rather than relying on
> > memory.high/max behavior with a static limit.
>
> But that's only side effects of a buffer being shared, right? (which,
> for a buffer sharing mechanism is still pretty important, but still)
>
> > > The shared ownership is indeed broken, but it's not more or less broken
> > > than, say, memfd + udmabuf, and I'm sure plenty of others.
> >
> > One thing that's worse about system heap buffers is that unlike memfd
> > the memory isn't reclaimable. So without killing all users there's
> > currently no way to deal with the zombie issue. Harry's proposing
> > reparenting, but I don't think our current interfaces support that
> > because we'd have to mess with the page structs behind system heap
> > dmabufs to change the memcg during reparenting.
> >
> > Ah... but udmabuf pins the memfd pages, so you're right that memfd +
> > udmabuf isn't worse.
> >
> > > So we really improve the common case, but only make the "advanced"
> > > slightly more broken than it already is.
> > >
> > > Would you disagree?
> >
> > I think memcg limits in this case just wouldn't be usable because of
> > what I mentioned above. In our common case the allocator is in a
> > different cgroup tree than the real users of the buffer.
>
> So, my issue with this is that we want to fix not only dma-buf itself,
> but every device buffer allocation mechanism, so also v4l2, drm, etc.
>
> So we'll need a lot of infrastructure and rework outside of dma-buf to
> get there, and figuring out how to solve the shared buffer accounting is
> indeed one of them, but was so far considered kind the thing to do last
> last time we discussed.
>
> What I get from that discussion is that we now consider it a
> prerequisite, and given how that topic has been advancing so far, one
> that would take a couple of years at best to materialize into something
> useful and upstream.
>
> Thus, it blocks all the work around it for years.
>
> Would you be open to merging patches that work on it but only enabled
> through a kernel parameter for example (and possibly taint the kernel?)?
> That would allow to work towards that goal while not being blocked by
> the shared buffer accounting, and not affecting the general case either.
>
> Maxime
Hi Maxime,
A kernel param or a CONFIG sound like a good compromise to allow work
to progress. I'd be happy to add my R-B to that.
Hi Alain,
kernel test robot noticed the following build warnings:
[auto build test WARNING on atorgue-stm32/stm32-next]
[also build test WARNING on robh/for-next linus/master v6.19-rc1 next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alain-Volmat/media-stm32-dcm…
base: https://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git stm32-next
patch link: https://lore.kernel.org/r/20251218-stm32-dcmi-dma-chaining-v1-1-39948ca6cbf…
patch subject: [PATCH 01/12] media: stm32: dcmi: Switch from __maybe_unused to pm_sleep_ptr()
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20251221/202512210044.xNNW6QJZ-lkp@…)
compiler: arc-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512210044.xNNW6QJZ-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512210044.xNNW6QJZ-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/media/platform/st/stm32/stm32-dcmi.c:2127:12: warning: 'dcmi_resume' defined but not used [-Wunused-function]
2127 | static int dcmi_resume(struct device *dev)
| ^~~~~~~~~~~
>> drivers/media/platform/st/stm32/stm32-dcmi.c:2116:12: warning: 'dcmi_suspend' defined but not used [-Wunused-function]
2116 | static int dcmi_suspend(struct device *dev)
| ^~~~~~~~~~~~
vim +/dcmi_resume +2127 drivers/media/platform/st/stm32/stm32-dcmi.c
2115
> 2116 static int dcmi_suspend(struct device *dev)
2117 {
2118 /* disable clock */
2119 pm_runtime_force_suspend(dev);
2120
2121 /* change pinctrl state */
2122 pinctrl_pm_select_sleep_state(dev);
2123
2124 return 0;
2125 }
2126
> 2127 static int dcmi_resume(struct device *dev)
2128 {
2129 /* restore pinctl default state */
2130 pinctrl_pm_select_default_state(dev);
2131
2132 /* clock enable */
2133 pm_runtime_force_resume(dev);
2134
2135 return 0;
2136 }
2137
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki