On Wed, Oct 15, 2025 at 2:39 PM Frank Li <Frank.li(a)nxp.com> wrote:
>
> On Wed, Oct 15, 2025 at 12:47:40PM -0500, Rob Herring (Arm) wrote:
> > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > relatively simple interface with single command stream to describe
> > buffers, operation settings, and network operations. It supports up to 8
> > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > for SRAM (like the downstream driver stack and compiler). Userspace
> > doesn't need access to the SRAM.
> >
> > The h/w has no MMU nor external IOMMU and is a DMA engine which can
> > read and write anywhere in memory without h/w bounds checks. The user
> > submitted command streams must be validated against the bounds of the
> > GEM BOs. This is similar to the VC4 design which validates shaders.
> >
> > The job submit is based on the rocket driver for the Rockchip NPU
> > utilizing the GPU scheduler. It is simpler as there's only 1 core rather
> > than 3.
> >
> > Tested on i.MX93 platform (U65) and FVP (U85) with WIP Mesa Teflon
> > support.
> >
> > Acked-by: Thomas Zimmermann <tzimmermann(a)suse.de>
> > Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
> > ---
>
> How to test this driver?
You need to add the DT node to i.MX93 .dts like the example, build the
mesa ethosu branch, and then run tflite with it pointed to the mesa
delegate.
I can send an i.MX93 dts patch after this is merged.
> > v4:
> > - Use bulk clk API
> > - Various whitespace fixes mostly due to ethos->ethosu rename
> > - Drop error check on dma_set_mask_and_coherent()
> > - Drop unnecessary pm_runtime_mark_last_busy() call
> > - Move variable declarations out of switch (a riscv/clang build failure)
> > - Use lowercase hex in all defines
> > - Drop unused ethosu_device.coherent member
> > - Add comments on all locks
> >
> ...
> > diff --git a/drivers/accel/ethosu/ethosu_device.h b/drivers/accel/ethosu/ethosu_device.h
> > new file mode 100644
> > index 000000000000..69d610c5c2d7
> > --- /dev/null
> > +++ b/drivers/accel/ethosu/ethosu_device.h
> > @@ -0,0 +1,190 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only or MIT */
> > +/* Copyright 2025 Arm, Ltd. */
> > +
> > +#ifndef __ETHOSU_DEVICE_H__
> > +#define __ETHOSU_DEVICE_H__
> > +
> > +#include <linux/types.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include <drm/gpu_scheduler.h>
> > +
> > +#include <drm/ethosu_accel.h>
> > +
> > +struct clk;
> > +struct gen_pool;
>
> Supposed should include clk.h instead declear a struct.
Headers should only use a forward declaration if that's all they need.
It keeps the struct opaque for starters.
> ...
> > +
> > +static int ethosu_open(struct drm_device *ddev, struct drm_file *file)
> > +{
> > + int ret = 0;
> > + struct ethosu_file_priv *priv;
> > +
> > + if (!try_module_get(THIS_MODULE))
> > + return -EINVAL;
> > +
> > + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > + if (!priv) {
> > + ret = -ENOMEM;
> > + goto err_put_mod;
> > + }
> > + priv->edev = to_ethosu_device(ddev);
> > +
> > + ret = ethosu_job_open(priv);
> > + if (ret)
> > + goto err_free;
> > +
> > + file->driver_priv = priv;
>
> slice simple.
>
> struct ethosu_file_priv __free(kfree) *priv = NULL;
> ...
> priv = kzalloc(sizeof(*priv), GFP_KERNEL);
Linus has voiced his opinion that the above should not be done. It
should be all one line *only*. But now we allow C99 declarations, so
we can move it down. We can't get rid of the goto for module_put(), so
it only marginally helps here.
> ...
>
> file->driver_priv = no_free_ptr(priv);
>
>
> > + return 0;
> > +
> > +err_free:
> > + kfree(priv);
> > +err_put_mod:
> > + module_put(THIS_MODULE);
> > + return ret;
> > +}
> > +
> ...
> > +
> > +
> > +static int ethosu_init(struct ethosu_device *ethosudev)
> > +{
> > + int ret;
> > + u32 id, config;
> > +
> > + ret = devm_pm_runtime_enable(ethosudev->base.dev);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pm_runtime_resume_and_get(ethosudev->base.dev);
> > + if (ret)
> > + return ret;
> > +
> > + pm_runtime_set_autosuspend_delay(ethosudev->base.dev, 50);
> > + pm_runtime_use_autosuspend(ethosudev->base.dev);
> > +
> > + /* If PM is disabled, we need to call ethosu_device_resume() manually. */
> > + if (!IS_ENABLED(CONFIG_PM)) {
> > + ret = ethosu_device_resume(ethosudev->base.dev);
> > + if (ret)
> > + return ret;
> > + }
>
> I think it should call ethosu_device_resume() unconditional before
> devm_pm_runtime_enable();
>
> ethosu_device_resume();
> pm_runtime_set_active();
> pm_runtime_set_autosuspend_delay(ethosudev->base.dev, 50);
> devm_pm_runtime_enable();
Why do you think this? Does this do a get?
I don't think it is good to call the resume hook on our own, but we
have no choice with !CONFIG_PM. With CONFIG_PM, we should only use the
pm_runtime API.
Rob
The Arm Ethos-U65/85 NPUs are designed for edge AI inference
applications[0].
The driver works with Mesa Teflon. A merge request for Ethos support is
here[1]. The UAPI should also be compatible with the downstream (open
source) driver stack[2] and Vela compiler though that has not been
implemented.
Testing so far has been on i.MX93 boards with Ethos-U65 and a FVP model
with Ethos-U85. More work is needed in mesa for handling U85 command
stream differences, but that doesn't affect the UABI.
A git tree is here[3].
Rob
[0] https://www.arm.com/products/silicon-ip-cpu?families=ethos%20npus
[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36699/
[2] https://gitlab.arm.com/artificial-intelligence/ethos-u/
[3] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git ethos-v4
Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
---
Changes in v4:
- Use bulk clk API
- Various whitespace fixes mostly due to ethos->ethosu rename
- Drop error check on dma_set_mask_and_coherent()
- Drop unnecessary pm_runtime_mark_last_busy() call
- Move variable declarations out of switch (a riscv/clang build failure)
- Use lowercase hex in all defines
- Drop unused ethosu_device.coherent member
- Add comments on all locks
- Link to v3: https://lore.kernel.org/r/20250926-ethos-v3-0-6bd24373e4f5@kernel.org
Changes in v3:
- Rework and improve job submit validation
- Rename ethos to ethosu. There was an Ethos-Nxx that's unrelated.
- Add missing init for sched_lock mutex
- Drop some prints to debug level
- Fix i.MX93 SRAM accesses (AXI config)
- Add U85 AXI configuration and test on FVP with U85
- Print the current cmd value on timeout
- Link to v2: https://lore.kernel.org/r/20250811-ethos-v2-0-a219fc52a95b@kernel.org
Changes in v2:
- Rebase on v6.17-rc1 adapting to scheduler changes
- scheduler: Drop the reset workqueue. According to the scheduler docs,
we don't need it since we have a single h/w queue.
- scheduler: Rework the timeout handling to continue running if we are
making progress. Fixes timeouts on larger jobs.
- Reset the NPU on resume so it's in a known state
- Add error handling on clk_get() calls
- Fix drm_mm splat on module unload. We were missing a put on the
cmdstream BO in the scheduler clean-up.
- Fix 0-day report needing explicit bitfield.h include
- Link to v1: https://lore.kernel.org/r/20250722-ethos-v1-0-cc1c5a0cbbfb@kernel.org
---
Rob Herring (Arm) (2):
dt-bindings: npu: Add Arm Ethos-U65/U85
accel: Add Arm Ethos-U NPU driver
.../devicetree/bindings/npu/arm,ethos.yaml | 79 +++
MAINTAINERS | 9 +
drivers/accel/Kconfig | 1 +
drivers/accel/Makefile | 1 +
drivers/accel/ethosu/Kconfig | 10 +
drivers/accel/ethosu/Makefile | 4 +
drivers/accel/ethosu/ethosu_device.h | 190 ++++++
drivers/accel/ethosu/ethosu_drv.c | 418 ++++++++++++
drivers/accel/ethosu/ethosu_drv.h | 15 +
drivers/accel/ethosu/ethosu_gem.c | 710 +++++++++++++++++++++
drivers/accel/ethosu/ethosu_gem.h | 46 ++
drivers/accel/ethosu/ethosu_job.c | 539 ++++++++++++++++
drivers/accel/ethosu/ethosu_job.h | 41 ++
include/uapi/drm/ethosu_accel.h | 261 ++++++++
14 files changed, 2324 insertions(+)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20250715-ethos-3fdd39ef6f19
Best regards,
--
Rob Herring (Arm) <robh(a)kernel.org>
On 14.10.25 10:32, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
Probably the subject should be "mm: reintroduce alloc_pages_bulk_list()"
>
> commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
> argument") drops alloc_pages_bulk_list. This commit would like to call back
> it since it is proved to be helpful to the drivers which allocate a bulk of
> pages(see patch of 2 in this series ).
"Let's reintroduce it so we can us for bulk allocation in the context of
XXX next."
> I do notice that Matthew's comment of the time cost of iterating a list.
> However, I also observed in our test that the extra page_array's allocation
> could be more expensive than cpu iteration when direct reclaiming happens
> when ram is low[1]. IMHO, could we leave the API here to have the users
> choose between the array or list according to their scenarios.
I'd prefer if we avoid reintroducing this interface.
How many pages are you intending to allocate? Wouldn't a smaller array
on the stack be sufficient?
--
Cheers
David / dhildenb
On Wed, Oct 15, 2025 at 09:12:07AM +0800, Zhaoyang Huang wrote:
> > Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
> ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
> controlled by sysfs interface?
No. Explain what you're trying to solve, because you haven't yet.
On 14.10.25 17:10, Petr Tesarik wrote:
> On Tue, 14 Oct 2025 15:04:14 +0200
> Christian König <christian.koenig(a)amd.com> wrote:
>
>> On 14.10.25 14:44, Zhaoyang Huang wrote:
>>> On Tue, Oct 14, 2025 at 7:59 PM Christian König
>>> <christian.koenig(a)amd.com> wrote:
>>>>
>>>> On 14.10.25 10:32, zhaoyang.huang wrote:
>>>>> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>>>>
>>>>> The size of once dma-buf allocation could be dozens MB or much more
>>>>> which introduce a loop of allocating several thousands of order-0 pages.
>>>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>>>> direct-reclaim during the loop. This commit would like to eliminate the
>>>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>>>> order-0 allocation. This patch is proved to be conditionally helpful
>>>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>>>> harm when bulk allocation can't be done(fallback to single page
>>>>> allocation)
>>>>
>>>> Well that sounds like an absolutely horrible idea.
>>>>
>>>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>>>
>>>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
>>> emm, this patch doesn't change order-8 and order-4's allocation
>>> behaviour but just to replace the loop of order-0 allocations into
>>> once bulk allocation in the fallback way. What is your concern about
>>> this?
>>
>> As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
>>
>> But that is exactly what we try to avoid here by allocating only certain order of pages.
>
> This is a good question, actually. Yes, bulk alloc will split large
> pages if there are insufficient pages on the pcp free list. But is
> dma-buf indeed trying to avoid it, or is it merely using an inefficient
> API? And does it need the extra speed? Even if it leads to increased
> fragmentation?
DMA-buf-heaps is completly intentionally trying rather hard to avoid splitting large pages. That's why you have the distinction between HIGH_ORDER_GFP and LOW_ORDER_GFP as well.
Keep in mind that this is mostly used on embedded system with only small amounts of memory.
Not entering direct reclaim and instead preferring to split large pages until they are used up is an absolutely no-go for most use cases as far as I can see.
Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
Regards,
Christian.
>
> Petr T
On Tue, Oct 14, 2025 at 04:32:28PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>
> This series of patches would like to introduce alloc_pages_bulk_list in
> dma-buf which need to call back the API for page allocation.
Start with the problem you're trying to solve.
On 14.10.25 14:44, Zhaoyang Huang wrote:
> On Tue, Oct 14, 2025 at 7:59 PM Christian König
> <christian.koenig(a)amd.com> wrote:
>>
>> On 14.10.25 10:32, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>>
>>> The size of once dma-buf allocation could be dozens MB or much more
>>> which introduce a loop of allocating several thousands of order-0 pages.
>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>> direct-reclaim during the loop. This commit would like to eliminate the
>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>> order-0 allocation. This patch is proved to be conditionally helpful
>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>> harm when bulk allocation can't be done(fallback to single page
>>> allocation)
>>
>> Well that sounds like an absolutely horrible idea.
>>
>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>
>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
> emm, this patch doesn't change order-8 and order-4's allocation
> behaviour but just to replace the loop of order-0 allocations into
> once bulk allocation in the fallback way. What is your concern about
> this?
As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
But that is exactly what we try to avoid here by allocating only certain order of pages.
Regards,
Christian.
>>
>> So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>> ---
>>> drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
>>> 1 file changed, 24 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index bbe7881f1360..71b028c63bd8 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
>>> .release = system_heap_dma_buf_release,
>>> };
>>>
>>> -static struct page *alloc_largest_available(unsigned long size,
>>> - unsigned int max_order)
>>> +static void alloc_largest_available(unsigned long size,
>>> + unsigned int max_order, unsigned int *num_pages, struct list_head *list)
>>> {
>>> struct page *page;
>>> int i;
>>> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
>>> if (max_order < orders[i])
>>> continue;
>>>
>>> - page = alloc_pages(order_flags[i], orders[i]);
>>> - if (!page)
>>> + if (orders[i]) {
>>> + page = alloc_pages(order_flags[i], orders[i]);
>>> + if (page) {
>>> + list_add(&page->lru, list);
>>> + *num_pages = 1;
>>> + }
>>> + } else
>>> + *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
>>> +
>>> + if (list_empty(list))
>>> continue;
>>> - return page;
>>> + return;
>>> }
>>> - return NULL;
>>> }
>>>
>>> static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> struct list_head pages;
>>> struct page *page, *tmp_page;
>>> int i, ret = -ENOMEM;
>>> + unsigned int num_pages;
>>> + LIST_HEAD(head);
>>>
>>> buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>>> if (!buffer)
>>> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> INIT_LIST_HEAD(&pages);
>>> i = 0;
>>> while (size_remaining > 0) {
>>> + num_pages = 0;
>>> + INIT_LIST_HEAD(&head);
>>> /*
>>> * Avoid trying to allocate memory if the process
>>> * has been killed by SIGKILL
>>> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> goto free_buffer;
>>> }
>>>
>>> - page = alloc_largest_available(size_remaining, max_order);
>>> - if (!page)
>>> + alloc_largest_available(size_remaining, max_order, &num_pages, &head);
>>> + if (!num_pages)
>>> goto free_buffer;
>>>
>>> - list_add_tail(&page->lru, &pages);
>>> - size_remaining -= page_size(page);
>>> - max_order = compound_order(page);
>>> - i++;
>>> + list_splice_tail(&head, &pages);
>>> + max_order = folio_order(lru_to_folio(&head));
>>> + size_remaining -= PAGE_SIZE * (num_pages << max_order);
>>> + i += num_pages;
>>> +
>>> }
>>>
>>> table = &buffer->sg_table;
>>
On 14.10.25 10:32, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>
> The size of once dma-buf allocation could be dozens MB or much more
> which introduce a loop of allocating several thousands of order-0 pages.
> Furthermore, the concurrent allocation could have dma-buf allocation enter
> direct-reclaim during the loop. This commit would like to eliminate the
> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> order-0 allocation. This patch is proved to be conditionally helpful
> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> harm when bulk allocation can't be done(fallback to single page
> allocation)
Well that sounds like an absolutely horrible idea.
See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
Regards,
Christian.
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index bbe7881f1360..71b028c63bd8 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
> .release = system_heap_dma_buf_release,
> };
>
> -static struct page *alloc_largest_available(unsigned long size,
> - unsigned int max_order)
> +static void alloc_largest_available(unsigned long size,
> + unsigned int max_order, unsigned int *num_pages, struct list_head *list)
> {
> struct page *page;
> int i;
> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
> if (max_order < orders[i])
> continue;
>
> - page = alloc_pages(order_flags[i], orders[i]);
> - if (!page)
> + if (orders[i]) {
> + page = alloc_pages(order_flags[i], orders[i]);
> + if (page) {
> + list_add(&page->lru, list);
> + *num_pages = 1;
> + }
> + } else
> + *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
> +
> + if (list_empty(list))
> continue;
> - return page;
> + return;
> }
> - return NULL;
> }
>
> static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> struct list_head pages;
> struct page *page, *tmp_page;
> int i, ret = -ENOMEM;
> + unsigned int num_pages;
> + LIST_HEAD(head);
>
> buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
> if (!buffer)
> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> INIT_LIST_HEAD(&pages);
> i = 0;
> while (size_remaining > 0) {
> + num_pages = 0;
> + INIT_LIST_HEAD(&head);
> /*
> * Avoid trying to allocate memory if the process
> * has been killed by SIGKILL
> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> goto free_buffer;
> }
>
> - page = alloc_largest_available(size_remaining, max_order);
> - if (!page)
> + alloc_largest_available(size_remaining, max_order, &num_pages, &head);
> + if (!num_pages)
> goto free_buffer;
>
> - list_add_tail(&page->lru, &pages);
> - size_remaining -= page_size(page);
> - max_order = compound_order(page);
> - i++;
> + list_splice_tail(&head, &pages);
> + max_order = folio_order(lru_to_folio(&head));
> + size_remaining -= PAGE_SIZE * (num_pages << max_order);
> + i += num_pages;
> +
> }
>
> table = &buffer->sg_table;