This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
The branch has various modifications to Leon's series I've suggested.
Jason Gunthorpe (8):
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 74 +++-
drivers/iommu/iommufd/io_pagetable.h | 53 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 13 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 142 ++++++
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
10 files changed, 741 insertions(+), 63 deletions(-)
base-commit: fc882154e421f82677925d33577226e776bb07a4
--
2.43.0
On Sun, Oct 26, 2025 at 09:44:14PM -0700, Vivek Kasireddy wrote:
> +/**
> + * dma_buf_match_interconnects - determine if there is a specific interconnect
> + * that is supported by both exporter and importer.
> + * @attach: [in] attachment to populate ic_match field
> + * @exp: [in] array of interconnects supported by exporter
> + * @exp_ics: [in] number of interconnects supported by exporter
> + * @imp: [in] array of interconnects supported by importer
> + * @imp_ics: [in] number of interconnects supported by importer
> + *
> + * This helper function iterates through the list interconnects supported by
> + * both exporter and importer to find a match. A successful match means that
> + * a common interconnect type is supported by both parties and the exporter's
> + * match_interconnect() callback also confirms that the importer is compatible
> + * with the exporter for that interconnect type.
Document which of the exporter/importer is supposed to call this
> + *
> + * If a match is found, the attach->ic_match field is populated with a copy
> + * of the exporter's match data.
> + * Return: true if a match is found, false otherwise.
> + */
> +bool dma_buf_match_interconnects(struct dma_buf_attachment *attach,
> + const struct dma_buf_interconnect_match *exp,
> + unsigned int exp_ics,
> + const struct dma_buf_interconnect_match *imp,
> + unsigned int imp_ics)
> +{
> + const struct dma_buf_interconnect_ops *ic_ops;
> + struct dma_buf_interconnect_match *ic_match;
> + struct dma_buf *dmabuf = attach->dmabuf;
> + unsigned int i, j;
> +
> + if (!exp || !imp)
> + return false;
> +
> + if (!attach->allow_ic)
> + return false;
Seems redundant with this check for ic_ops == NULL:
> + ic_ops = dmabuf->ops->interconnect_ops;
> + if (!ic_ops || !ic_ops->match_interconnect)
> + return false;
This seems like too much of a maze to me..
I think you should structure it like this. First declare an interconnect:
struct dma_buf_interconnect iov_interconnect {
.name = "IOV interconnect",
.match =..
}
Then the exporters "subclass"
struct dma_buf_interconnect_ops vfio_iov_interconnect {
.interconnect = &iov_interconnect,
.map = vfio_map,
}
I guess no container_of technique..
Then in VFIO's attach trigger the new code:
const struct dma_buf_interconnect_match vfio_exp_ics[] = {
{&vfio_iov_interconnect},
};
dma_buf_match_interconnects(attach, &vfio_exp_ics))
Which will callback to the importer:
static const struct dma_buf_attach_ops xe_dma_buf_attach_ops = {
.get_importer_interconnects
}
dma_buf_match_interconnects() would call
aops->get_importer_interconnects
and matchs first on .interconnect, then call the interconnect->match
function with exp/inpt match structs if not NULL.
> +struct dma_buf_interconnect_match {
> + const struct dma_buf_interconnect *type;
> + struct device *dev;
> + unsigned int bar;
> +};
This should be more general, dev and bar are unique to the iov
importer. Maybe just simple:
struct dma_buf_interconnect_match {
struct dma_buf_interconnect *ic; // no need for type
const struct dma_buf_interconnct_ops *exporter_ic_ops;
u64 match_data[2]; // dev and bar are IOV specific, generalize
};
Then some helper
const struct dma_buf_interconnect_match supports_ics[] = {
IOV_INTERCONNECT(&vfio_iov_interconnect, dev, bar),
}
And it would be nice if interconnect aware drivers could more easially
interwork with non-interconnect importers.
So I'd add a exporter type of 'p2p dma mapped scatterlist' that just
matches the legacy importer.
Jason
On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote:
> For the map operation, the dma-buf core will create an xarray but
> the exporter needs to populate it with the interconnect specific
> addresses. And, similarly for unmap, the exporter is expected to
> cleanup the individual entries of the xarray.
I don't think we should limit this to xarrays, nor do I think it is a
great datastructure for what is usually needed here..
I just posted the patches showing what iommufd needs, and it wants
something like
struct mapping {
struct p2p_provider *provider;
size_t nelms;
struct phys_vec *phys;
};
Which is not something that make sense as an xarray.
I think the interconnect should have its own functions for map/unmap,
ie instead of trying to have them as a commmon
dma_buf_interconnect_ops do something like
struct dma_buf_interconnect_ops {
const char *name;
bool (*supports_interconnects)(struct dma_buf_attachment *attach,
const struct dma_buf_interconnect_match *,
unsigned int num_ics);
};
struct dma_buf_iov_interconnect_ops {
struct dma_buf_interconnect_ops ic_ops;
struct xx *(*map)(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms);
// No unmap for iov
};
static inline struct xx *dma_buf_iov_map(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms)
{
return container_of(attach->ic_ops, struct dma_buf_iov_interconnect_ops, ic_ops)->map(
attach, bar_number, nelms));
}
> +/**
> + * dma_buf_attachment_is_dynamic - check if the importer can handle move_notify.
> + * @attach: the attachment to check
> + *
> + * Returns true if a DMA-buf importer has indicated that it can handle dmabuf
> + * location changes through the move_notify callback.
> + */
> +static inline bool
> +dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
> +{
> + return !!attach->importer_ops;
> +}
Why is this in this patch?
I also think this patch should be second in the series, it makes more
sense to figure out how to attach with an interconnect then show how
to map/unmap with that interconnect
Like I'm not sure why this introduces allow_ic?
Jason
On Sun, Oct 26, 2025 at 03:55:04PM +0800, Shuai Xue wrote:
>
>
> 在 2025/10/22 20:50, Jason Gunthorpe 写道:
> > On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > >
> > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > enabling safe sharing of non-struct page memory with controlled
> > > lifetime management. This allows RDMA and other subsystems to import
> > > dma-buf FDs and build them into memory regions for PCI P2P operations.
> > >
> > > The implementation provides a revocable attachment mechanism using
> > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > don't change physical addresses, but access is revoked when the VFIO
> > > device is closed or a PCI reset is issued. This ensures kernel
> > > self-defense against potentially hostile userspace.
> >
> > Let's enhance this:
> >
> > Currently VFIO can take MMIO regions from the device's BAR and map
> > them into a PFNMAP VMA with special PTEs. This mapping type ensures
> > the memory cannot be used with things like pin_user_pages(), hmm, and
> > so on. In practice only the user process CPU and KVM can safely make
> > use of these VMA. When VFIO shuts down these VMAs are cleaned by
> > unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
> > unbind.
> >
> > However, VFIO type 1 has an insecure behavior where it uses
> > follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
> > into the IOMMU. This has a long history of enabling P2P DMA inside
> > VMs, but has serious lifetime problems by allowing a UAF of the MMIO
> > after the VFIO driver has been unbound.
>
> Hi, Jason,
>
> Can you elaborate on this more?
>
> From my understanding of the VFIO type 1 implementation:
>
> - When a device is opened through VFIO type 1, it increments the
> device->refcount
> - During unbind, the driver waits for this refcount to drop to zero via
> wait_for_completion(&device->comp)
> - This should prevent the unbind() from completing while the device is
> still in use
>
> Given this refcount mechanism, I do not figure out how the UAF can
> occur.
A second vfio device can be opened and then use follow_pfnmap_*() to
read the first vfio device's PTEs. There is no relationship betweent
the first and second VFIO devices, so once the first is unbound it
sails through the device->comp while the second device retains the PFN
in its type1 iommu_domain.
Jason
On 10/20/25 13:18, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 10:16:23AM +0200, Philipp Stanner wrote:
>> On Fri, 2025-10-17 at 14:28 -0700, Matthew Brost wrote:
>>> On Fri, Oct 17, 2025 at 11:31:47AM +0200, Philipp Stanner wrote:
>>>> It seems that DMA_FENCE_FLAG_SEQNO64_BIT has no real effects anymore,
>>>> since seqno is a u64 everywhere.
>>>>
>>>> Remove the unneeded flag.
>>>>
>>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>>> ---
>>>> Seems to me that this flag doesn't really do anything anymore?
>>>>
>>>> I *suspect* that it could be that some drivers pass a u32 to
>>>> dma_fence_init()? I guess they could be ported, couldn't they.
>>>>
>>>
>>> Xe uses 32-bit hardware fence sequence numbers—see [1] and [2]. We could
>>> switch to 64-bit hardware fence sequence numbers, but that would require
>>> changes on the driver side. If you sent this to our CI, I’m fairly
>>> certain we’d see a bunch of failures. I suspect this would also break
>>> several other drivers.
>>
>> What exactly breaks? Help me out here; if you pass a u32 for a u64,
>
> Seqno wraps.
>
>> doesn't the C standard guarantee that the higher, unused 32 bits will
>> be 0?
>
> return (int)(lower_32_bits(f1) - lower_32_bits(f2)) > 0;
>
> Look at the above logic.
>
> f1 = 0x0;
> f2 = 0xffffffff; /* -1 */
>
> The above statement will correctly return true.
>
> Compared to the below statement which returns false.
>
> return f1 > f2;
>
> We test seqno wraps in Xe by setting our initial seqno to -127, again if
> you send this patch to our CI any test which sends more than 127 job on
> queue will likely fail.
Yeah, exactly that's why this flag is needed for quite a lot of things.
Question is what is missing in the documentation to make that clear?
Regards,
Christian.
>
> Matt
>
>>
>> Because the only thing the flag still does is do this lower_32 check in
>> fence_is_later.
>>
>> P.
>>
>>>
>>> As I mentioned, all Xe-supported platforms could be updated since their
>>> rings support 64-bit store instructions. However, I suspect that very
>>> old i915 platforms don’t support such instructions in the ring. I agree
>>> this is a legacy issue, and we should probably use 64-bit sequence
>>> numbers in Xe. But again, platforms and drivers that are decades old
>>> might break as a result.
>>>
>>> Matt
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.17.1/source/drivers/gpu/drm/xe/xe_hw_fe…
>>> [2] https://elixir.bootlin.com/linux/v6.17.1/source/drivers/gpu/drm/xe/xe_hw_fe…
>>>
>>>> P.
>>>> ---
>>>> drivers/dma-buf/dma-fence.c | 3 +--
>>>> include/linux/dma-fence.h | 10 +---------
>>>> 2 files changed, 2 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>>>> index 3f78c56b58dc..24794c027813 100644
>>>> --- a/drivers/dma-buf/dma-fence.c
>>>> +++ b/drivers/dma-buf/dma-fence.c
>>>> @@ -1078,8 +1078,7 @@ void
>>>> dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
>>>> spinlock_t *lock, u64 context, u64 seqno)
>>>> {
>>>> - __dma_fence_init(fence, ops, lock, context, seqno,
>>>> - BIT(DMA_FENCE_FLAG_SEQNO64_BIT));
>>>> + __dma_fence_init(fence, ops, lock, context, seqno, 0);
>>>> }
>>>> EXPORT_SYMBOL(dma_fence_init64);
>>>>
>>>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>>>> index 64639e104110..4eca2db28625 100644
>>>> --- a/include/linux/dma-fence.h
>>>> +++ b/include/linux/dma-fence.h
>>>> @@ -98,7 +98,6 @@ struct dma_fence {
>>>> };
>>>>
>>>> enum dma_fence_flag_bits {
>>>> - DMA_FENCE_FLAG_SEQNO64_BIT,
>>>> DMA_FENCE_FLAG_SIGNALED_BIT,
>>>> DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>> DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
>>>> @@ -470,14 +469,7 @@ dma_fence_is_signaled(struct dma_fence *fence)
>>>> */
>>>> static inline bool __dma_fence_is_later(struct dma_fence *fence, u64 f1, u64 f2)
>>>> {
>>>> - /* This is for backward compatibility with drivers which can only handle
>>>> - * 32bit sequence numbers. Use a 64bit compare when the driver says to
>>>> - * do so.
>>>> - */
>>>> - if (test_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags))
>>>> - return f1 > f2;
>>>> -
>>>> - return (int)(lower_32_bits(f1) - lower_32_bits(f2)) > 0;
>>>> + return f1 > f2;
>>>> }
>>>>
>>>> /**
>>>> --
>>>> 2.49.0
>>>>
>>
From: Matthew Auld <matthew.auld(a)intel.com>
[ Upstream commit edb1745fc618ba8ef63a45ce3ae60de1bdf29231 ]
Since the dma-resv is shared we don't need to reserve and add a fence
slot fence twice, plus no need to loop through the dependencies.
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt(a)intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Link: https://lore.kernel.org/r/20250829164715.720735-2-matthew.auld@intel.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
Explanation
- What it fixes
- Removes redundant dma-resv operations when a backup BO shares the
same reservation object as the original BO, preventing the same
fence from being reserved/added twice to the same `dma_resv`.
- Avoids scanning the same dependency set twice when source and
destination BOs share the same `dma_resv`.
- Why the change is correct
- The backup object is created to share the parent’s reservation
object, so a single reserve/add is sufficient:
- The backup BO is initialized with the parent’s resv:
`drivers/gpu/drm/xe/xe_bo.c:1309` (`xe_bo_init_locked(...,
bo->ttm.base.resv, ...)`), ensuring `bo->ttm.base.resv ==
backup->ttm.base.resv`.
- The patch adds an explicit invariant check to document and enforce
this: `drivers/gpu/drm/xe/xe_bo.c:1225` (`xe_assert(xe,
bo->ttm.base.resv == backup->ttm.base.resv)`).
- With shared `dma_resv`, adding the same fence twice is at best
redundant (wasting fence slots and memory) and at worst error-prone.
Reserving fence slots only once and adding the fence once is the
correct behavior.
- Specific code changes and effects
- Evict path (GPU migration copy case):
- Before: reserves and adds fence on both `bo->ttm.base.resv` and
`backup->ttm.base.resv`.
- After: reserves and adds exactly once, guarded by the shared-resv
assertion.
- See single reserve and add: `drivers/gpu/drm/xe/xe_bo.c:1226`
(reserve) and `drivers/gpu/drm/xe/xe_bo.c:1237` (add fence). This
is the core fix; the removed second reserve/add on the backup is
the redundant part eliminated.
- Restore path (migration copy back):
- Same simplification: reserve once, add once on the shared
`dma_resv`.
- See single reserve and add: `drivers/gpu/drm/xe/xe_bo.c:1375`
(reserve) and `drivers/gpu/drm/xe/xe_bo.c:1387` (add fence).
- Dependency handling in migrate:
- Before: added deps for both src and dst based only on `src_bo !=
dst_bo`.
- After: only add dst deps if the resv objects differ, avoiding
double-walking the same `dma_resv`.
- See updated condition: `drivers/gpu/drm/xe/xe_migrate.c:932`
(`src_bo->ttm.base.resv != dst_bo->ttm.base.resv`).
- User-visible impact without the patch
- Duplicate `dma_resv_add_fence()` calls on the same reservation
object can:
- Consume extra shared-fence slots and memory.
- Inflate dependency lists, causing unnecessary scheduler waits and
overhead.
- Increase failure likelihood of `dma_resv_reserve_fences()` under
memory pressure.
- These paths are exercised during suspend/resume flows of pinned VRAM
BOs (evict/restore), so reliability and performance in power
transitions can be affected.
- Scope and risk
- Small, focused changes localized to the Intel Xe driver
migration/evict/restore paths:
- Files: `drivers/gpu/drm/xe/xe_bo.c`,
`drivers/gpu/drm/xe/xe_migrate.c`.
- No API changes or architectural refactors; logic strictly reduces
redundant operations.
- The `xe_assert` acts as a safety net to catch unexpected non-shared
`resv` usage; normal runtime behavior is unchanged when the
invariant holds.
- The CPU copy fallback paths are untouched.
- Stable backport considerations
- This is a clear correctness and robustness fix, not a feature.
- Low regression risk if the stable branch also creates the backup BO
with the parent’s `dma_resv` (as shown by the use of
`xe_bo_init_locked(..., bo->ttm.base.resv, ...)` in
`drivers/gpu/drm/xe/xe_bo.c:1309`).
- If a stable branch diverges and the backup BO does not share the
resv, this patch would need adjustment (i.e., keep dual reserve/add
in that case). The added `xe_assert` helps surface such mismatches
during testing.
Conclusion: This commit fixes a real bug (duplicate fence reserve/add
and duplicate dependency scanning on a shared `dma_resv`) with a
minimal, well-scoped change. It aligns with stable rules (important
bugfix, low risk, contained), so it should be backported.
drivers/gpu/drm/xe/xe_bo.c | 13 +------------
drivers/gpu/drm/xe/xe_migrate.c | 2 +-
2 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index d07e23eb1a54d..5a61441d68af5 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1242,14 +1242,11 @@ int xe_bo_evict_pinned(struct xe_bo *bo)
else
migrate = mem_type_to_migrate(xe, bo->ttm.resource->mem_type);
+ xe_assert(xe, bo->ttm.base.resv == backup->ttm.base.resv);
ret = dma_resv_reserve_fences(bo->ttm.base.resv, 1);
if (ret)
goto out_backup;
- ret = dma_resv_reserve_fences(backup->ttm.base.resv, 1);
- if (ret)
- goto out_backup;
-
fence = xe_migrate_copy(migrate, bo, backup, bo->ttm.resource,
backup->ttm.resource, false);
if (IS_ERR(fence)) {
@@ -1259,8 +1256,6 @@ int xe_bo_evict_pinned(struct xe_bo *bo)
dma_resv_add_fence(bo->ttm.base.resv, fence,
DMA_RESV_USAGE_KERNEL);
- dma_resv_add_fence(backup->ttm.base.resv, fence,
- DMA_RESV_USAGE_KERNEL);
dma_fence_put(fence);
} else {
ret = xe_bo_vmap(backup);
@@ -1338,10 +1333,6 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
if (ret)
goto out_unlock_bo;
- ret = dma_resv_reserve_fences(backup->ttm.base.resv, 1);
- if (ret)
- goto out_unlock_bo;
-
fence = xe_migrate_copy(migrate, backup, bo,
backup->ttm.resource, bo->ttm.resource,
false);
@@ -1352,8 +1343,6 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
dma_resv_add_fence(bo->ttm.base.resv, fence,
DMA_RESV_USAGE_KERNEL);
- dma_resv_add_fence(backup->ttm.base.resv, fence,
- DMA_RESV_USAGE_KERNEL);
dma_fence_put(fence);
} else {
ret = xe_bo_vmap(backup);
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 2a627ed64b8f8..ba9b8590eccb2 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -901,7 +901,7 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
if (!fence) {
err = xe_sched_job_add_deps(job, src_bo->ttm.base.resv,
DMA_RESV_USAGE_BOOKKEEP);
- if (!err && src_bo != dst_bo)
+ if (!err && src_bo->ttm.base.resv != dst_bo->ttm.base.resv)
err = xe_sched_job_add_deps(job, dst_bo->ttm.base.resv,
DMA_RESV_USAGE_BOOKKEEP);
if (err)
--
2.51.0
On Tue, 21 Oct 2025 17:20:22 +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua(a)oppo.com>
>
> We can allocate high-order pages, but mapping them one by
> one is inefficient. This patch changes the code to map
> as large a chunk as possible. The code looks somewhat
>
> [ ... ]
Reviewed-by: Maxime Ripard <mripard(a)kernel.org>
Thanks!
Maxime
For retrieving a pointer to the struct dma_resv for a given GEM object. We
also introduce it in a new trait, BaseObjectPrivate, which we automatically
implement for all gem objects and don't expose to users outside of the
crate.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
---
rust/kernel/drm/gem/mod.rs | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs
index 32bff2e8463f4..67813cfb0db42 100644
--- a/rust/kernel/drm/gem/mod.rs
+++ b/rust/kernel/drm/gem/mod.rs
@@ -200,6 +200,18 @@ fn create_mmap_offset(&self) -> Result<u64> {
impl<T: IntoGEMObject> BaseObject for T {}
+/// Crate-private base operations shared by all GEM object classes.
+#[expect(unused)]
+pub(crate) trait BaseObjectPrivate: IntoGEMObject {
+ /// Return a pointer to this object's dma_resv.
+ fn raw_dma_resv(&self) -> *mut bindings::dma_resv {
+ // SAFETY: `as_gem_obj()` always returns a valid pointer to the base DRM gem object
+ unsafe { (*self.as_raw()).resv }
+ }
+}
+
+impl<T: IntoGEMObject> BaseObjectPrivate for T {}
+
/// A base GEM object.
///
/// Invariants
--
2.51.0
On Tue, Oct 21, 2025 at 4:43 PM Matthew Brost <matthew.brost(a)intel.com> wrote:
>
> On Sat, Oct 18, 2025 at 12:42:30AM -0700, Matthew Brost wrote:
> > On Fri, Oct 17, 2025 at 11:43:51PM -0700, Matthew Brost wrote:
> > > On Fri, Oct 17, 2025 at 10:37:46AM -0500, Rob Herring wrote:
> > > > On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
> > > > > On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
> > > > > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > > > > > relatively simple interface with single command stream to describe
> > > > > > buffers, operation settings, and network operations. It supports up to 8
> > > > > > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > > > > > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > > > > > for SRAM (like the downstream driver stack and compiler). Userspace
> > > > > > doesn't need access to the SRAM.
> > > >
> > > > Thanks for the review.
> > > >
> > > > [...]
> > > >
> > > > > > +static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
> > > > > > +{
> > > > > > + struct ethosu_job *job = to_ethosu_job(sched_job);
> > > > > > + struct ethosu_device *dev = job->dev;
> > > > > > + struct dma_fence *fence = NULL;
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (unlikely(job->base.s_fence->finished.error))
> > > > > > + return NULL;
> > > > > > +
> > > > > > + fence = ethosu_fence_create(dev);
> > > > >
> > > > > Another reclaim issue: ethosu_fence_create allocates memory using
> > > > > GFP_KERNEL. Since we're already in the DMA fence signaling path
> > > > > (reclaim), this can lead to a deadlock.
> > > > >
> > > > > Without too much thought, you likely want to move this allocation to
> > > > > ethosu_job_do_push, but before taking dev->sched_lock or calling
> > > > > drm_sched_job_arm.
> > > > >
> > > > > We really should fix the DRM scheduler work queue to be tainted with
> > > > > reclaim. If I recall correctly, we'd need to update the work queue
> > > > > layer. Let me look into that—I've seen this type of bug several times,
> > > > > and lockdep should be able to catch it.
> > > >
> > > > Likely the rocket driver suffers from the same issues...
> > > >
> > >
> > > I am not surprised by this statement.
> > >
> > > > >
> > > > > > + if (IS_ERR(fence))
> > > > > > + return fence;
> > > > > > +
> > > > > > + if (job->done_fence)
> > > > > > + dma_fence_put(job->done_fence);
> > > > > > + job->done_fence = dma_fence_get(fence);
> > > > > > +
> > > > > > + ret = pm_runtime_get_sync(dev->base.dev);
> > > > >
> > > > > I haven't looked at your PM design, but this generally looks quite
> > > > > dangerous with respect to reclaim. For example, if your PM resume paths
> > > > > allocate memory or take locks that allocate memory underneath, you're
> > > > > likely to run into issues.
> > > > >
> > > > > A better approach would be to attach a PM reference to your job upon
> > > > > creation and release it upon job destruction. That would be safer and
> > > > > save you headaches in the long run.
> > > >
> > > > Our PM is nothing more than clock enable/disable and register init.
> > > >
> > > > If the runtime PM API doesn't work and needs special driver wrappers,
> > > > then I'm inclined to just not use it and manage clocks directly (as
> > > > that's all it is doing).
> > > >
> > >
> > > Yes, then you’re probably fine. More complex drivers can do all sorts of
> > > things during a PM wake, which is why PM wakes should generally be the
> > > outermost layer. I still suggest, to future-proof your code, that you
> > > move the PM reference to an outer layer.
> > >
> >
> > Also, taking a PM reference in a function call — as opposed to tying it
> > to a object's lifetime — is risky. It can quickly lead to imbalances in
> > PM references if things go sideways or function calls become unbalanced.
> > Depending on how your driver uses the DRM scheduler, this seems like a
> > real possibility.
> >
> > Matt
> >
> > > > >
> > > > > This is what we do in Xe [1] [2].
> > > > >
> > > > > Also, in general, this driver has been reviewed (RB’d), but it's not
> > > > > great that I spotted numerous issues within just five minutes. I suggest
> > > > > taking a step back and thoroughly evaluating everything this driver is
> > > > > doing.
> > > >
> > > > Well, if it is hard to get simple drivers right, then it's a problem
> > > > with the subsystem APIs IMO.
> > > >
> > >
> > > Yes, agreed. We should have assertions and lockdep annotations in place
> > > to catch driver-side misuses. This is the second driver I’ve randomly
> > > looked at over the past year that has broken DMA fencing and reclaim
> > > rules. I’ll take an action item to fix this in the DRM scheduler, but
> > > I’m afraid I’ll likely break multiple drivers in the process as misuess
> > > / lockdep will complain.
>
> I've posted a series [1] for the DRM scheduler which will complain about the
> things I've pointed out here.
Thanks. I ran v6 with them and no lockdep splats.
Rob
Changelog:
v5:
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory
allocation
PCI/P2PDMA: Export pci_p2pdma_map_type() function
types: move phys_vec definition to common header
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature
functions
block/blk-mq-dma.c | 7 +-
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 175 ++++++++---
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/vfio_pci_config.c | 22 +-
drivers/vfio/pci/vfio_pci_core.c | 63 ++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 ++
drivers/vfio/vfio_main.c | 2 +
include/linux/pci-p2pdma.h | 120 +++++---
include/linux/types.h | 5 +
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 25 ++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
17 files changed, 785 insertions(+), 121 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
--
2.51.0
On Wed, 22 Oct 2025, Biancaa Ramesh <biancaa2210329(a)ssn.edu.in> wrote:
> --
> ::DISCLAIMER::
>
> ---------------------------------------------------------------------
> The
> contents of this e-mail and any attachment(s) are confidential and
> intended
> for the named recipient(s) only. Views or opinions, if any,
> presented in
> this email are solely those of the author and may not
> necessarily reflect
> the views or opinions of SSN Institutions (SSN) or its
> affiliates. Any form
> of reproduction, dissemination, copying, disclosure,
> modification,
> distribution and / or publication of this message without the
> prior written
> consent of authorized representative of SSN is strictly
> prohibited. If you
> have received this email in error please delete it and
> notify the sender
> immediately.
There are some obvious issues in the patch itself, but please do figure
out how to send patches and generally list email without disclaimers
like this first. Or use the b4 web submission endpoint [1].
BR,
Jani.
[1] https://b4.docs.kernel.org/en/latest/contributor/send.html
--
Jani Nikula, Intel
On Tue, 14 Oct 2025 16:26:06 +0530 Meghana Malladi wrote:
> This series adds AF_XDP zero coppy support to icssg driver.
>
> Tests were performed on AM64x-EVM with xdpsock application [1].
>
> A clear improvement is seen Transmit (txonly) and receive (rxdrop)
> for 64 byte packets. 1500 byte test seems to be limited by line
> rate (1G link) so no improvement seen there in packet rate
>
> Having some issue with l2fwd as the benchmarking numbers show 0
> for 64 byte packets after forwading first batch packets and I am
> currently looking into it.
This series stopped applying, could you please respin?
--
pw-bot: cr
The Arm Ethos-U65/85 NPUs are designed for edge AI inference
applications[0].
The driver works with Mesa Teflon. The Ethos support was merged on
10/15. The UAPI should also be compatible with the downstream (open
source) driver stack[2] and Vela compiler though that has not been
implemented.
Testing so far has been on i.MX93 boards with Ethos-U65 and a FVP model
with Ethos-U85. More work is needed in mesa for handling U85 command
stream differences, but that doesn't affect the UAPI.
A git tree is here[3].
Rob
[0] https://www.arm.com/products/silicon-ip-cpu?families=ethos%20npus
[2] https://gitlab.arm.com/artificial-intelligence/ethos-u/
[3] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git ethos-v6
Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
---
Changes in v6:
- Rework job submit to avoid potential deadlocks with allocations/reclaim
in the fence signaling paths. ethosu_acquire_object_fences() and the job
done_fence allocation are moved earlier. The runtime-pm resume now before
the job is pushed and autosuspend is done when the job is freed.
- Drop unused ethosu_job_is_idle()
- Link to v5: https://lore.kernel.org/r/20251016-ethos-v5-0-ba0aece0a006@kernel.org
Changes in v5:
- Rework Runtime PM init in probe
- Use __free() cleanups where possible
- Use devm_mutex_init()
- Handle U85 NPU_SET_WEIGHT2_BASE and NPU_SET_WEIGHT2_LENGTH
- Link to v4: https://lore.kernel.org/r/20251015-ethos-v4-0-81025a3dcbf3@kernel.org
Changes in v4:
- Use bulk clk API
- Various whitespace fixes mostly due to ethos->ethosu rename
- Drop error check on dma_set_mask_and_coherent()
- Drop unnecessary pm_runtime_mark_last_busy() call
- Move variable declarations out of switch (a riscv/clang build failure)
- Use lowercase hex in all defines
- Drop unused ethosu_device.coherent member
- Add comments on all locks
- Link to v3: https://lore.kernel.org/r/20250926-ethos-v3-0-6bd24373e4f5@kernel.org
Changes in v3:
- Rework and improve job submit validation
- Rename ethos to ethosu. There was an Ethos-Nxx that's unrelated.
- Add missing init for sched_lock mutex
- Drop some prints to debug level
- Fix i.MX93 SRAM accesses (AXI config)
- Add U85 AXI configuration and test on FVP with U85
- Print the current cmd value on timeout
- Link to v2: https://lore.kernel.org/r/20250811-ethos-v2-0-a219fc52a95b@kernel.org
Changes in v2:
- Rebase on v6.17-rc1 adapting to scheduler changes
- scheduler: Drop the reset workqueue. According to the scheduler docs,
we don't need it since we have a single h/w queue.
- scheduler: Rework the timeout handling to continue running if we are
making progress. Fixes timeouts on larger jobs.
- Reset the NPU on resume so it's in a known state
- Add error handling on clk_get() calls
- Fix drm_mm splat on module unload. We were missing a put on the
cmdstream BO in the scheduler clean-up.
- Fix 0-day report needing explicit bitfield.h include
- Link to v1: https://lore.kernel.org/r/20250722-ethos-v1-0-cc1c5a0cbbfb@kernel.org
---
Rob Herring (Arm) (2):
dt-bindings: npu: Add Arm Ethos-U65/U85
accel: Add Arm Ethos-U NPU driver
.../devicetree/bindings/npu/arm,ethos.yaml | 79 +++
MAINTAINERS | 9 +
drivers/accel/Kconfig | 1 +
drivers/accel/Makefile | 1 +
drivers/accel/ethosu/Kconfig | 10 +
drivers/accel/ethosu/Makefile | 4 +
drivers/accel/ethosu/ethosu_device.h | 195 ++++++
drivers/accel/ethosu/ethosu_drv.c | 403 ++++++++++++
drivers/accel/ethosu/ethosu_drv.h | 15 +
drivers/accel/ethosu/ethosu_gem.c | 704 +++++++++++++++++++++
drivers/accel/ethosu/ethosu_gem.h | 46 ++
drivers/accel/ethosu/ethosu_job.c | 496 +++++++++++++++
drivers/accel/ethosu/ethosu_job.h | 40 ++
include/uapi/drm/ethosu_accel.h | 261 ++++++++
14 files changed, 2264 insertions(+)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20250715-ethos-3fdd39ef6f19
Best regards,
--
Rob Herring (Arm) <robh(a)kernel.org>
Hi,
Here's another attempt at supporting user-space allocations from a
specific carved-out reserved memory region.
The initial problem we were discussing was that I'm currently working on
a platform which has a memory layout with ECC enabled. However, enabling
the ECC has a number of drawbacks on that platform: lower performance,
increased memory usage, etc. So for things like framebuffers, the
trade-off isn't great and thus there's a memory region with ECC disabled
to allocate from for such use cases.
After a suggestion from John, I chose to first start using heap
allocations flags to allow for userspace to ask for a particular ECC
setup. This is then backed by a new heap type that runs from reserved
memory chunks flagged as such, and the existing DT properties to specify
the ECC properties.
After further discussion, it was considered that flags were not the
right solution, and relying on the names of the heaps would be enough to
let userspace know the kind of buffer it deals with.
Thus, even though the uAPI part of it had been dropped in this second
version, we still needed a driver to create heaps out of carved-out memory
regions. In addition to the original usecase, a similar driver can be
found in BSPs from most vendors, so I believe it would be a useful
addition to the kernel.
Some extra discussion with Rob Herring [1] came to the conclusion that
some specific compatible for this is not great either, and as such an
new driver probably isn't called for either.
Some other discussions we had with John [2] also dropped some hints that
multiple CMA heaps might be a good idea, and some vendors seem to do
that too.
So here's another attempt that doesn't affect the device tree at all and
will just create a heap for every CMA reserved memory region.
It also falls nicely into the current plan we have to support cgroups in
DRM/KMS and v4l2, which is an additional benefit.
Let me know what you think,
Maxime
1: https://lore.kernel.org/all/20250707-cobalt-dingo-of-serenity-dbf92c@houat/
2: https://lore.kernel.org/all/CANDhNCroe6ZBtN_o=c71kzFFaWK-fF5rCdnr9P5h1sgPOW…
Let me know what you think,
Maxime
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v8:
- Rebased on top of 6.18-rc1
- Added TJ R-b
- Link to v7: https://lore.kernel.org/r/20250721-dma-buf-ecc-heap-v7-0-031836e1a942@kerne…
Changes in v7:
- Invert the logic and register CMA heap from the reserved memory /
dma contiguous code, instead of iterating over them from the CMA heap.
- Link to v6: https://lore.kernel.org/r/20250709-dma-buf-ecc-heap-v6-0-dac9bf80f35d@kerne…
Changes in v6:
- Drop the new driver and allocate a CMA heap for each region now
- Dropped the binding
- Rebased on 6.16-rc5
- Link to v5: https://lore.kernel.org/r/20250617-dma-buf-ecc-heap-v5-0-0abdc5863a4f@kerne…
Changes in v5:
- Rebased on 6.16-rc2
- Switch from property to dedicated binding
- Link to v4: https://lore.kernel.org/r/20250520-dma-buf-ecc-heap-v4-1-bd2e1f1bb42c@kerne…
Changes in v4:
- Rebased on 6.15-rc7
- Map buffers only when map is actually called, not at allocation time
- Deal with restricted-dma-pool and shared-dma-pool
- Reword Kconfig options
- Properly report dma_map_sgtable failures
- Link to v3: https://lore.kernel.org/r/20250407-dma-buf-ecc-heap-v3-0-97cdd36a5f29@kerne…
Changes in v3:
- Reworked global variable patch
- Link to v2: https://lore.kernel.org/r/20250401-dma-buf-ecc-heap-v2-0-043fd006a1af@kerne…
Changes in v2:
- Add vmap/vunmap operations
- Drop ECC flags uapi
- Rebase on top of 6.14
- Link to v1: https://lore.kernel.org/r/20240515-dma-buf-ecc-heap-v1-0-54cbbd049511@kerne…
---
Maxime Ripard (5):
doc: dma-buf: List the heaps by name
dma-buf: heaps: cma: Register list of CMA regions at boot
dma: contiguous: Register reusable CMA regions at boot
dma: contiguous: Reserve default CMA heap
dma-buf: heaps: cma: Create CMA heap for each CMA reserved region
Documentation/userspace-api/dma-buf-heaps.rst | 24 ++++++++------
MAINTAINERS | 1 +
drivers/dma-buf/heaps/Kconfig | 10 ------
drivers/dma-buf/heaps/cma_heap.c | 47 +++++++++++++++++----------
include/linux/dma-buf/heaps/cma.h | 16 +++++++++
kernel/dma/contiguous.c | 11 +++++++
6 files changed, 72 insertions(+), 37 deletions(-)
---
base-commit: 47633099a672fc7bfe604ef454e4f116e2c954b1
change-id: 20240515-dma-buf-ecc-heap-28a311d2c94e
prerequisite-message-id: <20250610131231.1724627-1-jkangas(a)redhat.com>
prerequisite-patch-id: bc44be5968feb187f2bc1b8074af7209462b18e7
prerequisite-patch-id: f02a91b723e5ec01fbfedf3c3905218b43d432da
prerequisite-patch-id: e944d0a3e22f2cdf4d3b3906e5603af934696deb
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
> On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
> > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > relatively simple interface with single command stream to describe
> > buffers, operation settings, and network operations. It supports up to 8
> > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > for SRAM (like the downstream driver stack and compiler). Userspace
> > doesn't need access to the SRAM.
Thanks for the review.
[...]
> > +static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
> > +{
> > + struct ethosu_job *job = to_ethosu_job(sched_job);
> > + struct ethosu_device *dev = job->dev;
> > + struct dma_fence *fence = NULL;
> > + int ret;
> > +
> > + if (unlikely(job->base.s_fence->finished.error))
> > + return NULL;
> > +
> > + fence = ethosu_fence_create(dev);
>
> Another reclaim issue: ethosu_fence_create allocates memory using
> GFP_KERNEL. Since we're already in the DMA fence signaling path
> (reclaim), this can lead to a deadlock.
>
> Without too much thought, you likely want to move this allocation to
> ethosu_job_do_push, but before taking dev->sched_lock or calling
> drm_sched_job_arm.
>
> We really should fix the DRM scheduler work queue to be tainted with
> reclaim. If I recall correctly, we'd need to update the work queue
> layer. Let me look into that—I've seen this type of bug several times,
> and lockdep should be able to catch it.
Likely the rocket driver suffers from the same issues...
>
> > + if (IS_ERR(fence))
> > + return fence;
> > +
> > + if (job->done_fence)
> > + dma_fence_put(job->done_fence);
> > + job->done_fence = dma_fence_get(fence);
> > +
> > + ret = pm_runtime_get_sync(dev->base.dev);
>
> I haven't looked at your PM design, but this generally looks quite
> dangerous with respect to reclaim. For example, if your PM resume paths
> allocate memory or take locks that allocate memory underneath, you're
> likely to run into issues.
>
> A better approach would be to attach a PM reference to your job upon
> creation and release it upon job destruction. That would be safer and
> save you headaches in the long run.
Our PM is nothing more than clock enable/disable and register init.
If the runtime PM API doesn't work and needs special driver wrappers,
then I'm inclined to just not use it and manage clocks directly (as
that's all it is doing).
>
> This is what we do in Xe [1] [2].
>
> Also, in general, this driver has been reviewed (RB’d), but it's not
> great that I spotted numerous issues within just five minutes. I suggest
> taking a step back and thoroughly evaluating everything this driver is
> doing.
Well, if it is hard to get simple drivers right, then it's a problem
with the subsystem APIs IMO.
Rob
For retrieving a pointer to the struct dma_resv for a given GEM object. We
also introduce it in a new trait, BaseObjectPrivate, which we automatically
implement for all gem objects and don't expose to users outside of the
crate.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
---
rust/kernel/drm/gem/mod.rs | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs
index 981fbb931e952..760fcd61da0b7 100644
--- a/rust/kernel/drm/gem/mod.rs
+++ b/rust/kernel/drm/gem/mod.rs
@@ -199,6 +199,18 @@ fn create_mmap_offset(&self) -> Result<u64> {
impl<T: IntoGEMObject> BaseObject for T {}
+/// Crate-private base operations shared by all GEM object classes.
+#[expect(unused)]
+pub(crate) trait BaseObjectPrivate: IntoGEMObject {
+ /// Return a pointer to this object's dma_resv.
+ fn raw_dma_resv(&self) -> *mut bindings::dma_resv {
+ // SAFETY: `as_gem_obj()` always returns a valid pointer to the base DRM gem object
+ unsafe { (*self.as_raw()).resv }
+ }
+}
+
+impl<T: IntoGEMObject> BaseObjectPrivate for T {}
+
/// A base GEM object.
///
/// Invariants
--
2.51.0
The Arm Ethos-U65/85 NPUs are designed for edge AI inference
applications[0].
The driver works with Mesa Teflon. The Ethos support was merged on
10/15. The UAPI should also be compatible with the downstream (open
source) driver stack[2] and Vela compiler though that has not been
implemented.
Testing so far has been on i.MX93 boards with Ethos-U65 and a FVP model
with Ethos-U85. More work is needed in mesa for handling U85 command
stream differences, but that doesn't affect the UAPI.
A git tree is here[3].
Rob
[0] https://www.arm.com/products/silicon-ip-cpu?families=ethos%20npus
[2] https://gitlab.arm.com/artificial-intelligence/ethos-u/
[3] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git ethos-v5
Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
---
Changes in v5:
- Rework Runtime PM init in probe
- Use __free() cleanups where possible
- Use devm_mutex_init()
- Handle U85 NPU_SET_WEIGHT2_BASE and NPU_SET_WEIGHT2_LENGTH
- Link to v4: https://lore.kernel.org/r/20251015-ethos-v4-0-81025a3dcbf3@kernel.org
Changes in v4:
- Use bulk clk API
- Various whitespace fixes mostly due to ethos->ethosu rename
- Drop error check on dma_set_mask_and_coherent()
- Drop unnecessary pm_runtime_mark_last_busy() call
- Move variable declarations out of switch (a riscv/clang build failure)
- Use lowercase hex in all defines
- Drop unused ethosu_device.coherent member
- Add comments on all locks
- Link to v3: https://lore.kernel.org/r/20250926-ethos-v3-0-6bd24373e4f5@kernel.org
Changes in v3:
- Rework and improve job submit validation
- Rename ethos to ethosu. There was an Ethos-Nxx that's unrelated.
- Add missing init for sched_lock mutex
- Drop some prints to debug level
- Fix i.MX93 SRAM accesses (AXI config)
- Add U85 AXI configuration and test on FVP with U85
- Print the current cmd value on timeout
- Link to v2: https://lore.kernel.org/r/20250811-ethos-v2-0-a219fc52a95b@kernel.org
Changes in v2:
- Rebase on v6.17-rc1 adapting to scheduler changes
- scheduler: Drop the reset workqueue. According to the scheduler docs,
we don't need it since we have a single h/w queue.
- scheduler: Rework the timeout handling to continue running if we are
making progress. Fixes timeouts on larger jobs.
- Reset the NPU on resume so it's in a known state
- Add error handling on clk_get() calls
- Fix drm_mm splat on module unload. We were missing a put on the
cmdstream BO in the scheduler clean-up.
- Fix 0-day report needing explicit bitfield.h include
- Link to v1: https://lore.kernel.org/r/20250722-ethos-v1-0-cc1c5a0cbbfb@kernel.org
---
Rob Herring (Arm) (2):
dt-bindings: npu: Add Arm Ethos-U65/U85
accel: Add Arm Ethos-U NPU driver
.../devicetree/bindings/npu/arm,ethos.yaml | 79 +++
MAINTAINERS | 9 +
drivers/accel/Kconfig | 1 +
drivers/accel/Makefile | 1 +
drivers/accel/ethosu/Kconfig | 10 +
drivers/accel/ethosu/Makefile | 4 +
drivers/accel/ethosu/ethosu_device.h | 195 ++++++
drivers/accel/ethosu/ethosu_drv.c | 403 ++++++++++++
drivers/accel/ethosu/ethosu_drv.h | 15 +
drivers/accel/ethosu/ethosu_gem.c | 704 +++++++++++++++++++++
drivers/accel/ethosu/ethosu_gem.h | 46 ++
drivers/accel/ethosu/ethosu_job.c | 540 ++++++++++++++++
drivers/accel/ethosu/ethosu_job.h | 41 ++
include/uapi/drm/ethosu_accel.h | 261 ++++++++
14 files changed, 2309 insertions(+)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20250715-ethos-3fdd39ef6f19
Best regards,
--
Rob Herring (Arm) <robh(a)kernel.org>
On Thu, 25 Sep 2025 17:30:33 +0530, Jyothi Kumar Seerapu wrote:
> The I2C driver gets an interrupt upon transfer completion.
> When handling multiple messages in a single transfer, this
> results in N interrupts for N messages, leading to significant
> software interrupt latency.
>
> To mitigate this latency, utilize Block Event Interrupt (BEI)
> mechanism. Enabling BEI instructs the hardware to prevent interrupt
> generation and BEI is disabled when an interrupt is necessary.
>
> [...]
Applied, thanks!
[1/2] dmaengine: qcom: gpi: Add GPI Block event interrupt support
commit: 4e8331317e73902e8b2663352c8766227e633901
[2/2] i2c: i2c-qcom-geni: Add Block event interrupt support
commit: 398035178503bf662281bbffb4bebce1460a4bc5
Best regards,
--
~Vinod
On Wed, Oct 15, 2025 at 4:02 PM Frank Li <Frank.li(a)nxp.com> wrote:
>
> On Wed, Oct 15, 2025 at 03:36:05PM -0500, Rob Herring wrote:
> > On Wed, Oct 15, 2025 at 2:39 PM Frank Li <Frank.li(a)nxp.com> wrote:
> > >
> > > On Wed, Oct 15, 2025 at 12:47:40PM -0500, Rob Herring (Arm) wrote:
> > > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > > > relatively simple interface with single command stream to describe
> > > > buffers, operation settings, and network operations. It supports up to 8
> > > > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > > > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > > > for SRAM (like the downstream driver stack and compiler). Userspace
> > > > doesn't need access to the SRAM.
> > > > +static int ethosu_init(struct ethosu_device *ethosudev)
> > > > +{
> > > > + int ret;
> > > > + u32 id, config;
> > > > +
> > > > + ret = devm_pm_runtime_enable(ethosudev->base.dev);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = pm_runtime_resume_and_get(ethosudev->base.dev);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + pm_runtime_set_autosuspend_delay(ethosudev->base.dev, 50);
> > > > + pm_runtime_use_autosuspend(ethosudev->base.dev);
> > > > +
> > > > + /* If PM is disabled, we need to call ethosu_device_resume() manually. */
> > > > + if (!IS_ENABLED(CONFIG_PM)) {
> > > > + ret = ethosu_device_resume(ethosudev->base.dev);
> > > > + if (ret)
> > > > + return ret;
> > > > + }
> > >
> > > I think it should call ethosu_device_resume() unconditional before
> > > devm_pm_runtime_enable();
> > >
> > > ethosu_device_resume();
> > > pm_runtime_set_active();
> > > pm_runtime_set_autosuspend_delay(ethosudev->base.dev, 50);
> > > devm_pm_runtime_enable();
> >
> > Why do you think this? Does this do a get?
> >
> > I don't think it is good to call the resume hook on our own, but we
> > have no choice with !CONFIG_PM. With CONFIG_PM, we should only use the
> > pm_runtime API.
>
> Enable clock and do some init work at probe() is quite common. But I never
> seen IS_ENABLED(CONFIG_PM) check. It is quite weird and not necessary to
> check CONFIG_PM flags. The most CONFIG_PM is enabled, so the branch !CONFIG_PM
> almost never tested.
Okay, I get what you meant.
>
> probe()
> {
> devm_clk_bulk_get_all_enabled();
>
> ... did some init work
>
> pm_runtime_set_active();
> devm_pm_runtime_enable();
>
> ...
> pm_runtime_put_autosuspend(ethosudev->base.dev);
> }
I think we still need a pm_runtime_get_noresume() in here since we do
a put later on. Here's what I have now:
ret = ethosu_device_resume(ethosudev->base.dev);
if (ret)
return ret;
pm_runtime_set_autosuspend_delay(ethosudev->base.dev, 50);
pm_runtime_use_autosuspend(ethosudev->base.dev);
ret = devm_pm_runtime_set_active_enabled(ethosudev->base.dev);
if (ret)
return ret;
pm_runtime_get_noresume(ethosudev->base.dev);
Rob
On 03-10-25, 20:50, Andi Shyti wrote:
> On Thu, Sep 25, 2025 at 05:30:35PM +0530, Jyothi Kumar Seerapu wrote:
> > From: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
> >
> > The I2C driver gets an interrupt upon transfer completion.
> > When handling multiple messages in a single transfer, this
> > results in N interrupts for N messages, leading to significant
> > software interrupt latency.
> >
> > To mitigate this latency, utilize Block Event Interrupt (BEI)
> > mechanism. Enabling BEI instructs the hardware to prevent interrupt
> > generation and BEI is disabled when an interrupt is necessary.
> >
> > Large I2C transfer can be divided into chunks of messages internally.
> > Interrupts are not expected for the messages for which BEI bit set,
> > only the last message triggers an interrupt, indicating the completion of
> > N messages. This BEI mechanism enhances overall transfer efficiency.
> >
> > BEI optimizations are currently implemented for I2C write transfers only,
> > as there is no use case for multiple I2C read messages in a single transfer
> > at this time.
> >
> > Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
>
> Because this series is touching multiple subsystems, I'm going to
> ack it:
>
> Acked-by: Andi Shyti <andi.shyti(a)kernel.org>
>
> We are waiting for someone from DMA to ack it (Vinod or Sinan).
Thanks, I will pick it with your ack
--
~Vinod
On Thu, Oct 16, 2025 at 02:09:12AM +0000, Kriish Sharma wrote:
> diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
> index a0979440d2a4..c0035dc257e0 100644
> --- a/Documentation/userspace-api/dma-buf-heaps.rst
> +++ b/Documentation/userspace-api/dma-buf-heaps.rst
> @@ -26,6 +26,7 @@ following heaps:
> ``DMABUF_HEAPS_CMA_LEGACY`` Kconfig option is set, a duplicate node is
> created following legacy naming conventions; the legacy name might be
> ``reserved``, ``linux,cma``, or ``default-pool``.
> +
> Naming Convention
> =================
>
LGTM, thanks!
Reviewed-by: Bagas Sanjaya <bagasdotme(a)gmail.com>
--
An old man doll... just what I always wanted! - Clara