The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us
This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v4
Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v4
Changelog v4 * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks Nicolin
Nicolin Chen (23): iommufd/viommu: Add driver-allocated vDEVICE support iommu: Pass in a driver-level user data structure to viommu_alloc op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommu: Add iommu_copy_struct_to_user helper iommufd/driver: Let iommufd_viommu_alloc helper save ictx to viommu->ictx iommufd/driver: Add iommufd_struct_destroy to revert iommufd_viommu_alloc iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add covearge for viommu data iommufd: Abstract iopt_pin_pages and iopt_unpin_pages helpers iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu/arm-smmu-v3-iommufd: Add vsmmu_alloc impl op iommu/arm-smmu-v3-iommufd: Support implementation-defined hw_info iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +- drivers/iommu/iommufd/io_pagetable.h | 8 + drivers/iommu/iommufd/iommufd_private.h | 28 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 43 +- include/linux/iommufd.h | 186 ++++++- include/uapi/linux/iommufd.h | 116 ++++- tools/testing/selftests/iommu/iommufd_utils.h | 52 +- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 43 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 490 +++++++++++++++++- drivers/iommu/iommufd/device.c | 117 +---- drivers/iommu/iommufd/driver.c | 94 ++++ drivers/iommu/iommufd/io_pagetable.c | 95 ++++ drivers/iommu/iommufd/main.c | 80 ++- drivers/iommu/iommufd/selftest.c | 133 ++++- drivers/iommu/iommufd/viommu.c | 121 ++++- tools/testing/selftests/iommu/iommufd.c | 97 +++- .../selftests/iommu/iommufd_fail_nth.c | 11 +- Documentation/userspace-api/iommufd.rst | 12 + 19 files changed, 1577 insertions(+), 194 deletions(-)
base-commit: 92a09c47464d040866cf2b4cd052bc60555185fb
To allow IOMMU drivers to allocate own vDEVICE structures, move the struct iommufd_vdevice to the public header and provide a pair of viommu ops.
The iommufd_vdevice_alloc_ioctl will prioritize the callback function from the viommu ops, i.e. a driver-allocated vDEVICE.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 8 ------ include/linux/iommufd.h | 38 +++++++++++++++++++++++++ drivers/iommu/iommufd/viommu.c | 9 +++++- 3 files changed, 46 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 80e8c76d25f2..5c69ac05c029 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -607,14 +607,6 @@ void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj);
-struct iommufd_vdevice { - struct iommufd_object obj; - struct iommufd_ctx *ictx; - struct iommufd_viommu *viommu; - struct device *dev; - u64 id; /* per-vIOMMU virtual ID */ -}; - #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); void iommufd_selftest_destroy(struct iommufd_object *obj); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 34b6e6ca4bfa..422eda95d19e 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -104,6 +104,14 @@ struct iommufd_viommu { unsigned int type; };
+struct iommufd_vdevice { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommufd_viommu *viommu; + struct device *dev; + u64 id; /* per-vIOMMU virtual ID */ +}; + /** * struct iommufd_viommu_ops - vIOMMU specific operations * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory @@ -120,6 +128,15 @@ struct iommufd_viommu { * array->entry_num to report the number of handled requests. * The data structure of the array entry must be defined in * include/uapi/linux/iommufd.h + * @vdevice_alloc: Allocate a vDEVICE object and init its driver-level structure + * or HW procedure. Note that the core-level structure is filled + * by the iommufd core after calling this op. @virt_id carries a + * per-vIOMMU virtual ID (refer to struct iommu_vdevice_alloc in + * include/uapi/linux/iommufd.h) for the driver to initialize HW + * for an attached physical device. + * @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice. + * The memory of the vDEVICE will be free-ed by iommufd core + * after calling this op */ struct iommufd_viommu_ops { void (*destroy)(struct iommufd_viommu *viommu); @@ -128,6 +145,10 @@ struct iommufd_viommu_ops { const struct iommu_user_data *user_data); int (*cache_invalidate)(struct iommufd_viommu *viommu, struct iommu_user_data_array *array); + struct iommufd_vdevice *(*vdevice_alloc)(struct iommufd_viommu *viommu, + struct device *dev, + u64 virt_id); + void (*vdevice_destroy)(struct iommufd_vdevice *vdev); };
#if IS_ENABLED(CONFIG_IOMMUFD) @@ -245,4 +266,21 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret->member.ops = viommu_ops; \ ret; \ }) + +#define iommufd_vdevice_alloc(viommu, drv_struct, member) \ + ({ \ + drv_struct *ret; \ + \ + static_assert(__same_type(struct iommufd_viommu, *viommu)); \ + static_assert(__same_type(struct iommufd_vdevice, \ + ((drv_struct *)NULL)->member)); \ + static_assert(offsetof(drv_struct, member.obj) == 0); \ + ret = (drv_struct *)_iommufd_object_alloc( \ + viommu->ictx, sizeof(drv_struct), IOMMUFD_OBJ_VDEVICE);\ + if (!IS_ERR(ret)) { \ + ret->member.viommu = viommu; \ + ret->member.ictx = viommu->ictx; \ + } \ + ret; \ + }) #endif diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 01df2b985f02..d4c7c5072e42 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -90,6 +90,9 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_vdevice, obj); struct iommufd_viommu *viommu = vdev->viommu;
+ if (viommu->ops && viommu->ops->vdevice_destroy) + viommu->ops->vdevice_destroy(vdev); + /* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */ xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL); refcount_dec(&viommu->obj.users); @@ -124,7 +127,11 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_idev; }
- vdev = iommufd_object_alloc(ucmd->ictx, vdev, IOMMUFD_OBJ_VDEVICE); + if (viommu->ops && viommu->ops->vdevice_alloc) + vdev = viommu->ops->vdevice_alloc(viommu, idev->dev, virt_id); + else + vdev = iommufd_object_alloc(ucmd->ictx, vdev, + IOMMUFD_OBJ_VDEVICE); if (IS_ERR(vdev)) { rc = PTR_ERR(vdev); goto out_put_idev;
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
+#define iommufd_vdevice_alloc(viommu, drv_struct, member) \
- ({ \
drv_struct *ret; \
\
static_assert(__same_type(struct iommufd_viommu,
*viommu)); \
static_assert(__same_type(struct iommufd_vdevice, \
((drv_struct *)NULL)->member));
\
static_assert(offsetof(drv_struct, member.obj) == 0); \
ret = (drv_struct *)_iommufd_object_alloc( \
viommu->ictx, sizeof(drv_struct),
IOMMUFD_OBJ_VDEVICE);\
if (!IS_ERR(ret)) { \
ret->member.viommu = viommu; \
ret->member.ictx = viommu->ictx; \
} \
ret; \
- })
viommu->ictx is set in patch5, so logically this is better put after that.
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 15, 2025 at 05:42:46AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
+#define iommufd_vdevice_alloc(viommu, drv_struct, member) \
- ({ \
drv_struct *ret; \
\
static_assert(__same_type(struct iommufd_viommu,
*viommu)); \
static_assert(__same_type(struct iommufd_vdevice, \
((drv_struct *)NULL)->member));
\
static_assert(offsetof(drv_struct, member.obj) == 0); \
ret = (drv_struct *)_iommufd_object_alloc( \
viommu->ictx, sizeof(drv_struct),
IOMMUFD_OBJ_VDEVICE);\
if (!IS_ERR(ret)) { \
ret->member.viommu = viommu; \
ret->member.ictx = viommu->ictx; \
} \
ret; \
- })
viommu->ictx is set in patch5, so logically this is better put after that.
Reviewed-by: Kevin Tian kevin.tian@intel.com
OK. Will rearrange a bit.
Thanks Nicolin
The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one of its virtual command queue HW resources exclusively. This requires user space to mmap the corresponding MMIO page from kernel space for direct HW control.
To forward the mmap info (offset and length), iommufd should add a driver specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver to output the info during the allocation back to user space.
Similar to the existing ioctls and their IOMMU handlers, add a user_data to viommu_alloc op to bridge between iommufd and drivers.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 ++- include/linux/iommu.h | 3 ++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++- drivers/iommu/iommufd/selftest.c | 8 ++++---- drivers/iommu/iommufd/viommu.c | 2 +- 5 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index dd1ad56ce863..6b8f0d20dac3 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1062,7 +1062,8 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, struct iommu_domain *parent, struct iommufd_ctx *ictx, - unsigned int viommu_type); + unsigned int viommu_type, + const struct iommu_user_data *user_data); int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, struct arm_smmu_nested_domain *nested_domain); void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 3a8d35d41fda..ba7add27e9a0 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -662,7 +662,8 @@ struct iommu_ops {
struct iommufd_viommu *(*viommu_alloc)( struct device *dev, struct iommu_domain *parent_domain, - struct iommufd_ctx *ictx, unsigned int viommu_type); + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data);
const struct iommu_domain_ops *default_domain_ops; unsigned long pgsize_bitmap; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index e4fd8d522af8..66855cae775e 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -385,7 +385,8 @@ static const struct iommufd_viommu_ops arm_vsmmu_ops = { struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, struct iommu_domain *parent, struct iommufd_ctx *ictx, - unsigned int viommu_type) + unsigned int viommu_type, + const struct iommu_user_data *user_data) { struct arm_smmu_device *smmu = iommu_get_iommu_dev(dev, struct arm_smmu_device, iommu); diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 18d9a216eb30..8b8ba4fb91cd 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -733,10 +733,10 @@ static struct iommufd_viommu_ops mock_viommu_ops = { .cache_invalidate = mock_viommu_cache_invalidate, };
-static struct iommufd_viommu *mock_viommu_alloc(struct device *dev, - struct iommu_domain *domain, - struct iommufd_ctx *ictx, - unsigned int viommu_type) +static struct iommufd_viommu * +mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data) { struct mock_iommu_device *mock_iommu = iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev); diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index d4c7c5072e42..fffa57063c60 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -48,7 +48,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) }
viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain, - ucmd->ictx, cmd->type); + ucmd->ictx, cmd->type, NULL); if (IS_ERR(viommu)) { rc = PTR_ERR(viommu); goto out_put_hwpt;
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one of its virtual command queue HW resources exclusively. This requires user space to mmap the corresponding MMIO page from kernel space for direct HW control.
To forward the mmap info (offset and length), iommufd should add a driver specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver to output the info during the allocation back to user space.
Similar to the existing ioctls and their IOMMU handlers, add a user_data to viommu_alloc op to bridge between iommufd and drivers.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
The new type of vIOMMU for tegra241-cmdqv driver needs a driver-specific user data. So, add data_len/uptr to the iommu_viommu_alloc uAPI and pass it in via the viommu_alloc iommu op.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Acked-by: Pranjal Shrivastava praan@google.com Acked-by: Alok Tiwari alok.a.tiwari@oracle.com Reviewed-by: Vasant Hegde vasant.hegde@amd.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/uapi/linux/iommufd.h | 6 ++++++ drivers/iommu/iommufd/viommu.c | 8 +++++++- 2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f29b6c44655e..cc90299a08d9 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -965,6 +965,9 @@ enum iommu_viommu_type { * @dev_id: The device's physical IOMMU will be used to back the virtual IOMMU * @hwpt_id: ID of a nesting parent HWPT to associate to * @out_viommu_id: Output virtual IOMMU ID for the allocated object + * @data_len: Length of the type specific data + * @__reserved: Must be 0 + * @data_uptr: User pointer to an array of driver-specific virtual IOMMU data * * Allocate a virtual IOMMU object, representing the underlying physical IOMMU's * virtualization support that is a security-isolated slice of the real IOMMU HW @@ -985,6 +988,9 @@ struct iommu_viommu_alloc { __u32 dev_id; __u32 hwpt_id; __u32 out_viommu_id; + __u32 data_len; + __u32 __reserved; + __aligned_u64 data_uptr; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index fffa57063c60..a65153458a26 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -17,6 +17,11 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) { struct iommu_viommu_alloc *cmd = ucmd->cmd; + const struct iommu_user_data user_data = { + .type = cmd->type, + .uptr = u64_to_user_ptr(cmd->data_uptr), + .len = cmd->data_len, + }; struct iommufd_hwpt_paging *hwpt_paging; struct iommufd_viommu *viommu; struct iommufd_device *idev; @@ -48,7 +53,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) }
viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain, - ucmd->ictx, cmd->type, NULL); + ucmd->ictx, cmd->type, + user_data.len ? &user_data : NULL); if (IS_ERR(viommu)) { rc = PTR_ERR(viommu); goto out_put_hwpt;
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
- @data_len: Length of the type specific data
- @__reserved: Must be 0
- @data_uptr: User pointer to an array of driver-specific virtual IOMMU
data
I don't think there is 'an array'.
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 15, 2025 at 05:45:40AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
- @data_len: Length of the type specific data
- @__reserved: Must be 0
- @data_uptr: User pointer to an array of driver-specific virtual IOMMU
data
I don't think there is 'an array'.
Reviewed-by: Kevin Tian kevin.tian@intel.com
Ah, copy-n-paste mistake. Will fix.
Thanks! Nicolin
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ba7add27e9a0..02e9d5904836 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -562,6 +562,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, return 0; }
+/** + * __iommu_copy_struct_to_user - Report iommu driver specific user space data + * @dst_data: Pointer to a struct iommu_user_data for user space data location + * @src_data: Pointer to an iommu driver specific user data that is defined in + * include/uapi/linux/iommufd.h + * @data_type: The data type of the @src_data. Must match with @dst_data.type + * @data_len: Length of current user data structure, i.e. sizeof(struct _src) + * @min_len: Initial length of user data structure for backward compatibility. + * This should be offsetofend using the last member in the user data + * struct that was initially added to include/uapi/linux/iommufd.h + */ +static inline int +__iommu_copy_struct_to_user(const struct iommu_user_data *dst_data, + void *src_data, unsigned int data_type, + size_t data_len, size_t min_len) +{ + if (WARN_ON(!dst_data || !src_data)) + return -EINVAL; + if (dst_data->type != data_type) + return -EINVAL; + if (dst_data->len < min_len || data_len < dst_data->len) + return -EINVAL; + return copy_struct_to_user(dst_data->uptr, dst_data->len, src_data, + data_len, NULL); +} + +/** + * iommu_copy_struct_to_user - Report iommu driver specific user space data + * @user_data: Pointer to a struct iommu_user_data for user space data location + * @ksrc: Pointer to an iommu driver specific user data that is defined in + * include/uapi/linux/iommufd.h + * @data_type: The data type of the @ksrc. Must match with @user_data->type + * @min_last: The last member of the data structure @ksrc points in the initial + * version. + * Return 0 for success, otherwise -error. + */ +#define iommu_copy_struct_to_user(user_data, ksrc, data_type, min_last) \ + __iommu_copy_struct_to_user(user_data, ksrc, data_type, sizeof(*ksrc), \ + offsetofend(typeof(*ksrc), min_last)) + /** * struct iommu_ops - iommu ops and capabilities * @capable: check capability
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
When an IOMMU driver calls iommufd_viommu_alloc(), it must pass in an ictx pointer as the underlying _iommufd_object_alloc() helper function requires that to allocate a new object. However, neither the iommufd_viommu_alloc() nor its underlying _iommufd_object_alloc() saves the ictx in the allocated viommu object, although viommu could hold an ictx pointer.
When the IOMMU driver wants to use another iommufd function passing in the allocated viommu, it could have avoided passing in the ictx pointer again, if viommu->ictx is valid.
Save ictx to viommu->ictx in the iommufd_viommu_alloc(), in order to ease a new vIOMMU-based helper that would then get the ictx from viommu->ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 422eda95d19e..dc6535e848df 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -262,8 +262,10 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, static_assert(offsetof(drv_struct, member.obj) == 0); \ ret = (drv_struct *)_iommufd_object_alloc( \ ictx, sizeof(drv_struct), IOMMUFD_OBJ_VIOMMU); \ - if (!IS_ERR(ret)) \ + if (!IS_ERR(ret)) { \ ret->member.ops = viommu_ops; \ + ret->member.ictx = ictx; \ + } \ ret; \ })
On Thu, May 08, 2025 at 08:02:26PM -0700, Nicolin Chen wrote:
When an IOMMU driver calls iommufd_viommu_alloc(), it must pass in an ictx pointer as the underlying _iommufd_object_alloc() helper function requires that to allocate a new object. However, neither the iommufd_viommu_alloc() nor its underlying _iommufd_object_alloc() saves the ictx in the allocated viommu object, although viommu could hold an ictx pointer.
When the IOMMU driver wants to use another iommufd function passing in the allocated viommu, it could have avoided passing in the ictx pointer again, if viommu->ictx is valid.
Save ictx to viommu->ictx in the iommufd_viommu_alloc(), in order to ease a new vIOMMU-based helper that would then get the ictx from viommu->ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommufd.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
It is OK, but please think carefully if the ictx is actually needed. The reason most objects don't have an ictx in them is because their contexts are always inside an ioctl so they get the ictx from there.
We don't have a lot of viommu allocations so it isn't such a big deal, but just generally that is how it was built that the ictx comes from the ioctl not the object.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
On Wed, May 14, 2025 at 02:06:37PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:26PM -0700, Nicolin Chen wrote:
When an IOMMU driver calls iommufd_viommu_alloc(), it must pass in an ictx pointer as the underlying _iommufd_object_alloc() helper function requires that to allocate a new object. However, neither the iommufd_viommu_alloc() nor its underlying _iommufd_object_alloc() saves the ictx in the allocated viommu object, although viommu could hold an ictx pointer.
When the IOMMU driver wants to use another iommufd function passing in the allocated viommu, it could have avoided passing in the ictx pointer again, if viommu->ictx is valid.
Save ictx to viommu->ictx in the iommufd_viommu_alloc(), in order to ease a new vIOMMU-based helper that would then get the ictx from viommu->ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommufd.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
It is OK, but please think carefully if the ictx is actually needed. The reason most objects don't have an ictx in them is because their contexts are always inside an ioctl so they get the ictx from there.
We don't have a lot of viommu allocations so it isn't such a big deal, but just generally that is how it was built that the ictx comes from the ioctl not the object.
It's more of providing cleaner for-driver APIs.
Especially with the new ucmd changes, it can be just: // in my_viommu_alloc() my_viommu = iommufd_viommu_alloc(ucmd, ...); iommufd_viommu_alloc_mmap(my_viommu, ...); iommufd_viommu_destroy_mmap(my_viommu, ...); // in my_viommu_destory() iommufd_viommu_destroy_mmap(my_viommu, ...);
Otherwise, they would be something like: // in my_viommu_alloc() my_viommu = iommufd_viommu_alloc(ucmd, ...); iommufd_viommu_alloc_mmap(ucmd->ictx, my_viommu, ...); iommufd_viommu_destroy_mmap(ucmd->ictx, my_viommu, ...); // in my_viommu_destory() iommufd_viommu_destroy_mmap(my_viommu->ictx, my_viommu ...);
That being said, I ended up doing something more in this patch for the ucmd rework and Kevin's comments:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 98fe5a49c9606..ddd144031af5d 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -136,14 +136,6 @@ int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, unsigned long length);
-struct iommufd_ucmd { - struct iommufd_ctx *ictx; - void __user *ubuffer; - u32 user_size; - void *cmd; - struct iommufd_object *alloced_obj; -}; - int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd, unsigned long arg);
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 416cd281eb2b5..6c0a2a0885a32 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -60,9 +60,14 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_hwpt; }
+ /* The iommufd_viommu_alloc helper saves ucmd->ictx in viommu->ictx */ + if (WARN_ON_ONCE(viommu->ictx != ucmd->ictx)) { + rc = -EINVAL; + goto out_put_hwpt; + } + xa_init(&viommu->vdevs); viommu->type = cmd->type; - viommu->ictx = ucmd->ictx; viommu->hwpt = hwpt_paging; refcount_inc(&viommu->hwpt->common.obj.users); INIT_LIST_HEAD(&viommu->veventqs); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index a24b924cf6872..74d11c6779739 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -52,6 +52,14 @@ struct iommufd_object { unsigned int id; };
+struct iommufd_ucmd { + struct iommufd_ctx *ictx; + void __user *ubuffer; + u32 user_size; + void *cmd; + struct iommufd_object *alloced_obj; +}; + struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, struct device *dev, u32 *id); void iommufd_device_unbind(struct iommufd_device *idev); @@ -252,8 +260,10 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, static_assert(offsetof(drv_struct, member.obj) == 0); \ ret = (drv_struct *)_iommufd_object_alloc_ucmd( \ ucmd, sizeof(drv_struct), IOMMUFD_OBJ_VIOMMU); \ - if (!IS_ERR(ret)) \ + if (!IS_ERR(ret)) { \ ret->member.ops = viommu_ops; \ + ret->member.ictx = ucmd->ictx; \ + } \ ret; \ }) #endif
Thanks Nicolin
On Thu, May 15, 2025 at 07:05:45PM -0700, Nicolin Chen wrote:
Otherwise, they would be something like: // in my_viommu_alloc() my_viommu = iommufd_viommu_alloc(ucmd, ...); iommufd_viommu_alloc_mmap(ucmd->ictx, my_viommu, ...); iommufd_viommu_destroy_mmap(ucmd->ictx, my_viommu, ...); // in my_viommu_destory() iommufd_viommu_destroy_mmap(my_viommu->ictx, my_viommu ...);
Yes, this is how the other objects work..
That being said, I ended up doing something more in this patch for the ucmd rework and Kevin's comments:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 98fe5a49c9606..ddd144031af5d 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -136,14 +136,6 @@ int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, unsigned long length); -struct iommufd_ucmd {
- struct iommufd_ctx *ictx;
- void __user *ubuffer;
- u32 user_size;
- void *cmd;
- struct iommufd_object *alloced_obj;
-};
You don't need to move this unless you are using inlines. Just use a forward declaration.
Jason
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
When an IOMMU driver calls iommufd_viommu_alloc(), it must pass in an ictx pointer as the underlying _iommufd_object_alloc() helper function requires that to allocate a new object. However, neither the iommufd_viommu_alloc() nor its underlying _iommufd_object_alloc() saves the ictx in the allocated viommu object, although viommu could hold an ictx pointer.
When the IOMMU driver wants to use another iommufd function passing in the allocated viommu, it could have avoided passing in the ictx pointer again, if viommu->ictx is valid.
Save ictx to viommu->ictx in the iommufd_viommu_alloc(), in order to ease a new vIOMMU-based helper that would then get the ictx from viommu->ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use. For instance:
static my_viommu_alloc() { ... my_viommu = iommufd_viommu_alloc(viomm, struct my_viommu, core); ... ret = init_my_viommu(); if (ret) { /* Need to destroy the my_viommu->core */ return ERR_PTR(ret); } return &my_viommu->core; }
Move iommufd_object_abort() to the driver.c file and the public header, to introduce common iommufd_struct_destroy() helper that will abort all kinds of driver structures, not confined to iommufd_viommu but also the new ones being added in the future.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 1 - include/linux/iommufd.h | 16 ++++++++++++++++ drivers/iommu/iommufd/driver.c | 14 ++++++++++++++ drivers/iommu/iommufd/main.c | 13 ------------- 4 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 5c69ac05c029..8d96aa514033 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -180,7 +180,6 @@ static inline void iommufd_put_object(struct iommufd_ctx *ictx, wake_up_interruptible_all(&ictx->destroy_wait); }
-void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx, struct iommufd_object *obj); void iommufd_object_finalize(struct iommufd_ctx *ictx, diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index dc6535e848df..f6f333eaaae3 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -213,6 +213,7 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx) struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, enum iommufd_object_type type); +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -228,6 +229,11 @@ _iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, return ERR_PTR(-EOPNOTSUPP); }
+static inline void iommufd_object_abort(struct iommufd_ctx *ictx, + struct iommufd_object *obj) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -285,4 +291,14 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, } \ ret; \ }) + +/* Helper for IOMMU driver to destroy structures created by allocators above */ +#define iommufd_struct_destroy(drv_struct, member) \ + ({ \ + static_assert(__same_type(struct iommufd_object, \ + drv_struct->member.obj)); \ + static_assert(offsetof(typeof(*drv_struct), member.obj) == 0); \ + iommufd_object_abort(drv_struct->member.ictx, \ + &drv_struct->member.obj); \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 922cd1fe7ec2..7980a09761c2 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -36,6 +36,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, } EXPORT_SYMBOL_NS_GPL(_iommufd_object_alloc, "IOMMUFD");
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */ +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) +{ + XA_STATE(xas, &ictx->objects, obj->id); + void *old; + + xa_lock(&ictx->objects); + old = xas_store(&xas, NULL); + xa_unlock(&ictx->objects); + WARN_ON(old != XA_ZERO_ENTRY); + kfree(obj); +} +EXPORT_SYMBOL_NS_GPL(iommufd_object_abort, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 3df468f64e7d..2b9ee9b4a424 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -51,19 +51,6 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, WARN_ON(old != XA_ZERO_ENTRY); }
-/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */ -void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) -{ - XA_STATE(xas, &ictx->objects, obj->id); - void *old; - - xa_lock(&ictx->objects); - old = xas_store(&xas, NULL); - xa_unlock(&ictx->objects); - WARN_ON(old != XA_ZERO_ENTRY); - kfree(obj); -} - /* * Abort an object that has been fully initialized and needs destroy, but has * not been finalized.
On Thu, May 08, 2025 at 08:02:27PM -0700, Nicolin Chen wrote:
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use. For instance:
static my_viommu_alloc() { ... my_viommu = iommufd_viommu_alloc(viomm, struct my_viommu, core); ... ret = init_my_viommu(); if (ret) { /* Need to destroy the my_viommu->core */ return ERR_PTR(ret); } return &my_viommu->core; }
Move iommufd_object_abort() to the driver.c file and the public header, to introduce common iommufd_struct_destroy() helper that will abort all kinds of driver structures, not confined to iommufd_viommu but also the new ones being added in the future.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 1 - include/linux/iommufd.h | 16 ++++++++++++++++ drivers/iommu/iommufd/driver.c | 14 ++++++++++++++ drivers/iommu/iommufd/main.c | 13 ------------- 4 files changed, 30 insertions(+), 14 deletions(-)
One idea that struck me when I was looking at this was to copy the technique from rdma.
When an object is allocated we keep track of it in the struct iommufd_ucmd.
Then when the command is over the core code either aborts or finalizes the objects in the iommufd_ucmd. This way you don't have to expose abort and related to drivers.
Something like this:
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 2d2695e2562d02..289dd19eec90f1 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -36,6 +36,16 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, } EXPORT_SYMBOL_NS_GPL(_iommufd_object_alloc, "IOMMUFD");
+struct iommufd_object *_iommufd_object_alloc_ucmd(struct iommufd_ucmd *ucmd, + size_t size, + enum iommufd_object_type type) +{ + if (ucmd->alloced_obj) + return ERR_PTR(-EBUSY); + ucmd->alloced_obj = _iommufd_object_alloc(ucmd->ictx, size, type); + return ucmd->alloced_obj; +} + /* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */ void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) { diff --git a/drivers/iommu/iommufd/eventq.c b/drivers/iommu/iommufd/eventq.c index f39cf079734769..f109948a81ac8c 100644 --- a/drivers/iommu/iommufd/eventq.c +++ b/drivers/iommu/iommufd/eventq.c @@ -473,7 +473,7 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd) if (cmd->flags) return -EOPNOTSUPP;
- fault = __iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT, + fault = __iommufd_object_alloc_ucmd(ucmd, fault, IOMMUFD_OBJ_FAULT, common.obj); if (IS_ERR(fault)) return PTR_ERR(fault); @@ -483,10 +483,8 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
fdno = iommufd_eventq_init(&fault->common, "[iommufd-pgfault]", ucmd->ictx, &iommufd_fault_fops); - if (fdno < 0) { - rc = fdno; - goto out_abort; - } + if (fdno < 0) + return fdno;
cmd->out_fault_id = fault->common.obj.id; cmd->out_fault_fd = fdno; @@ -494,7 +492,6 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd) rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); if (rc) goto out_put_fdno; - iommufd_object_finalize(ucmd->ictx, &fault->common.obj);
fd_install(fdno, fault->common.filep);
@@ -502,9 +499,6 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd) out_put_fdno: put_unused_fd(fdno); fput(fault->common.filep); -out_abort: - iommufd_object_abort_and_destroy(ucmd->ictx, &fault->common.obj); - return rc; }
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index c87326d7ecfccb..94cafae86d271c 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -152,6 +152,7 @@ struct iommufd_ucmd { void __user *ubuffer; u32 user_size; void *cmd; + struct iommufd_object *alloced_obj; };
int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd, @@ -258,6 +259,15 @@ iommufd_object_put_and_try_destroy(struct iommufd_ctx *ictx, #define iommufd_object_alloc(ictx, ptr, type) \ __iommufd_object_alloc(ictx, ptr, type, obj)
+#define __iommufd_object_alloc_uctx(uctx, ptr, type, obj) \ + container_of(_iommufd_object_alloc_ucmd( \ + uctx, \ + sizeof(*(ptr)) + BUILD_BUG_ON_ZERO( \ + offsetof(typeof(*(ptr)), \ + obj) != 0), \ + type), \ + typeof(*(ptr)), obj) + /* * The IO Address Space (IOAS) pagetable is a virtual page table backed by the * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 1d7f3584aea0f7..0bc9c02f6bfc4f 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -408,6 +408,14 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, if (ret) return ret; ret = op->execute(&ucmd); + + if (ucmd.alloced_obj) { + if (ret) + iommufd_object_abort_and_destroy(ictx, + ucmd.alloced_obj); + else + iommufd_object_finalize(ictx, ucmd.alloced_obj); + } return ret; }
On Wed, May 14, 2025 at 03:26:00PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:27PM -0700, Nicolin Chen wrote:
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use. For instance:
static my_viommu_alloc() { ... my_viommu = iommufd_viommu_alloc(viomm, struct my_viommu, core); ... ret = init_my_viommu(); if (ret) { /* Need to destroy the my_viommu->core */ return ERR_PTR(ret); } return &my_viommu->core; }
Move iommufd_object_abort() to the driver.c file and the public header, to introduce common iommufd_struct_destroy() helper that will abort all kinds of driver structures, not confined to iommufd_viommu but also the new ones being added in the future.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 1 - include/linux/iommufd.h | 16 ++++++++++++++++ drivers/iommu/iommufd/driver.c | 14 ++++++++++++++ drivers/iommu/iommufd/main.c | 13 ------------- 4 files changed, 30 insertions(+), 14 deletions(-)
One idea that struck me when I was looking at this was to copy the technique from rdma.
When an object is allocated we keep track of it in the struct iommufd_ucmd.
Then when the command is over the core code either aborts or finalizes the objects in the iommufd_ucmd. This way you don't have to expose abort and related to drivers.
I see! Do you want this to apply to the all objects or just driver allocated ones?
We would need a bigger preparatory series to roll out that to all the allocators, and need to be careful at the existing abort() that intertwines with other steps like an unlock().
Thanks Nicolin
On Wed, May 14, 2025 at 12:21:37PM -0700, Nicolin Chen wrote:
Then when the command is over the core code either aborts or finalizes the objects in the iommufd_ucmd. This way you don't have to expose abort and related to drivers.
I see! Do you want this to apply to the all objects or just driver allocated ones?
I would do all the ones that can work that way easily
I think it would just be one patch, replace this patch with that one.
We would need a bigger preparatory series to roll out that to all the allocators, and need to be careful at the existing abort() that intertwines with other steps like an unlock().
Those cases with special locking couldn't use it.
Jason
On Thu, May 15, 2025 at 09:49:11AM -0300, Jason Gunthorpe wrote:
On Wed, May 14, 2025 at 12:21:37PM -0700, Nicolin Chen wrote:
Then when the command is over the core code either aborts or finalizes the objects in the iommufd_ucmd. This way you don't have to expose abort and related to drivers.
I see! Do you want this to apply to the all objects or just driver allocated ones?
I would do all the ones that can work that way easily
I think it would just be one patch, replace this patch with that one.
We would need a bigger preparatory series to roll out that to all the allocators, and need to be careful at the existing abort() that intertwines with other steps like an unlock().
Those cases with special locking couldn't use it.
OK. I will add a patch in v5, replacing this one.
Thanks Nicolin
Add a simple user_data for an input-to-output loopback test.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 13 +++++++++++++ drivers/iommu/iommufd/selftest.c | 19 +++++++++++++++++++ 2 files changed, 32 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 1cd7e8394129..fbf9ecb35a13 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -227,6 +227,19 @@ struct iommu_hwpt_invalidate_selftest {
#define IOMMU_VIOMMU_TYPE_SELFTEST 0xdeadbeef
+/** + * struct iommu_viommu_selftest - vIOMMU data for Mock driver + * (IOMMU_VIOMMU_TYPE_SELFTEST) + * @in_data: Input random data from user space + * @out_data: Output data (matching @in_data) to user space + * + * Simply set @out_data=@in_data for a loopback test + */ +struct iommu_viommu_selftest { + __u32 in_data; + __u32 out_data; +}; + /* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST 0xdeadbeef #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 8b8ba4fb91cd..033345f81df9 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -740,16 +740,35 @@ mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, { struct mock_iommu_device *mock_iommu = iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev); + struct iommu_viommu_selftest data; struct mock_viommu *mock_viommu; + int rc;
if (viommu_type != IOMMU_VIOMMU_TYPE_SELFTEST) return ERR_PTR(-EOPNOTSUPP);
+ if (user_data) { + rc = iommu_copy_struct_from_user( + &data, user_data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); + if (rc) + return ERR_PTR(rc); + } + mock_viommu = iommufd_viommu_alloc(ictx, struct mock_viommu, core, &mock_viommu_ops); if (IS_ERR(mock_viommu)) return ERR_CAST(mock_viommu);
+ if (user_data) { + data.out_data = data.in_data; + rc = iommu_copy_struct_to_user( + user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); + if (rc) { + iommufd_struct_destroy(mock_viommu, core); + return ERR_PTR(rc); + } + } + refcount_inc(&mock_iommu->users); return &mock_viommu->core; }
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
Add a simple user_data for an input-to-output loopback test.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Extend the existing test_cmd/err_viommu_alloc helpers to accept optional user data. And add a TEST_F for a loopback test.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd_utils.h | 21 ++++++++----- tools/testing/selftests/iommu/iommufd.c | 30 ++++++++++++++----- .../selftests/iommu/iommufd_fail_nth.c | 5 ++-- 3 files changed, 39 insertions(+), 17 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 72f6636e5d90..a5d4cbd089ba 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -897,7 +897,8 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 pasid, pasid, fault_fd))
static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, - __u32 type, __u32 flags, __u32 *viommu_id) + __u32 flags, __u32 type, void *data, + __u32 data_len, __u32 *viommu_id) { struct iommu_viommu_alloc cmd = { .size = sizeof(cmd), @@ -905,6 +906,8 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, .type = type, .dev_id = device_id, .hwpt_id = hwpt_id, + .data_uptr = (uint64_t)data, + .data_len = data_len, }; int ret;
@@ -916,13 +919,15 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, return 0; }
-#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id) \ - ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ - type, 0, viommu_id)) -#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \ - EXPECT_ERRNO(_errno, \ - _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ - type, 0, viommu_id)) +#define test_cmd_viommu_alloc(device_id, hwpt_id, type, data, data_len, \ + viommu_id) \ + ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \ + type, data, data_len, viommu_id)) +#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, data, \ + data_len, viommu_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \ + type, data, data_len, viommu_id))
static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id, __u64 virt_id, __u32 *vdev_id) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 1a8e85afe9aa..666a91d53195 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2688,7 +2688,7 @@ FIXTURE_SETUP(iommufd_viommu)
/* Allocate a vIOMMU taking refcount of the parent hwpt */ test_cmd_viommu_alloc(self->device_id, self->hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &self->viommu_id);
/* Allocate a regular nested hwpt */ @@ -2727,24 +2727,27 @@ TEST_F(iommufd_viommu, viommu_negative_tests) if (self->device_id) { /* Negative test -- invalid hwpt (hwpt_id=0) */ test_err_viommu_alloc(ENOENT, device_id, 0, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL);
/* Negative test -- not a nesting parent hwpt */ test_cmd_hwpt_alloc(device_id, ioas_id, 0, &hwpt_id); test_err_viommu_alloc(EINVAL, device_id, hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL); test_ioctl_destroy(hwpt_id);
/* Negative test -- unsupported viommu type */ test_err_viommu_alloc(EOPNOTSUPP, device_id, self->hwpt_id, - 0xdead, NULL); + 0xdead, NULL, 0, NULL); EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->hwpt_id)); EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->viommu_id)); } else { test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL); } }
@@ -2791,6 +2794,20 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf) } }
+TEST_F(iommufd_viommu, viommu_alloc_with_data) +{ + struct iommu_viommu_selftest data = { + .in_data = 0xbeef, + }; + + if (self->device_id) { + test_cmd_viommu_alloc(self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, &data, + sizeof(data), &self->viommu_id); + assert(data.out_data == data.in_data); + } +} + TEST_F(iommufd_viommu, vdevice_alloc) { uint32_t viommu_id = self->viommu_id; @@ -3105,8 +3122,7 @@ TEST_F(iommufd_device_pasid, pasid_attach)
/* Allocate a regular nested hwpt based on viommu */ test_cmd_viommu_alloc(self->device_id, parent_hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, - &viommu_id); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &viommu_id); test_cmd_hwpt_alloc_nested(self->device_id, viommu_id, IOMMU_HWPT_ALLOC_PASID, &nested_hwpt_id[2], diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index e11ec4b121fc..f7ccf1822108 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -688,8 +688,9 @@ TEST_FAIL_NTH(basic_fail_nth, device) IOMMU_HWPT_DATA_NONE, 0, 0)) return -1;
- if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, 0, &viommu_id)) + if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, 0, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + &viommu_id)) return -1;
if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id))
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:02 AM
Extend the existing test_cmd/err_viommu_alloc helpers to accept optional user data. And add a TEST_F for a loopback test.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
The new HW QUEUE object will be added for HW to access the guest queue for HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the HW QUEUE allocator.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/io_pagetable.h | 8 ++ drivers/iommu/iommufd/iommufd_private.h | 6 ++ drivers/iommu/iommufd/device.c | 117 ++---------------------- drivers/iommu/iommufd/io_pagetable.c | 95 +++++++++++++++++++ 4 files changed, 117 insertions(+), 109 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 10c928a9a463..4288a2b1a90f 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -114,6 +114,14 @@ static inline unsigned long iopt_area_iova_to_index(struct iopt_area *area, return iopt_area_start_byte(area, iova) / PAGE_SIZE; }
+static inline bool iopt_area_check_prot(struct iopt_area *area, + unsigned int flags) +{ + if (flags & IOMMUFD_ACCESS_RW_WRITE) + return area->iommu_prot & IOMMU_WRITE; + return area->iommu_prot & IOMMU_READ; +} + #define __make_iopt_iter(name) \ static inline struct iopt_##name *iopt_##name##_iter_first( \ struct io_pagetable *iopt, unsigned long start, \ diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 8d96aa514033..79160b039bc7 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -130,6 +130,12 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas, void iopt_enable_large_pages(struct io_pagetable *iopt); int iopt_disable_large_pages(struct io_pagetable *iopt);
+int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length, struct page **out_pages, + unsigned int flags); +void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length); + struct iommufd_ucmd { struct iommufd_ctx *ictx; void __user *ubuffer; diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 2111bad72c72..a5c6be164254 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -1240,58 +1240,17 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, void iommufd_access_unpin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length) { - struct iopt_area_contig_iter iter; - struct io_pagetable *iopt; - unsigned long last_iova; - struct iopt_area *area; - - if (WARN_ON(!length) || - WARN_ON(check_add_overflow(iova, length - 1, &last_iova))) - return; - - mutex_lock(&access->ioas_lock); + guard(mutex)(&access->ioas_lock); /* * The driver must be doing something wrong if it calls this before an * iommufd_access_attach() or after an iommufd_access_detach(). */ - if (WARN_ON(!access->ioas_unpin)) { - mutex_unlock(&access->ioas_lock); + if (WARN_ON(!access->ioas_unpin)) return; - } - iopt = &access->ioas_unpin->iopt; - - down_read(&iopt->iova_rwsem); - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) - iopt_area_remove_access( - area, iopt_area_iova_to_index(area, iter.cur_iova), - iopt_area_iova_to_index( - area, - min(last_iova, iopt_area_last_iova(area)))); - WARN_ON(!iopt_area_contig_done(&iter)); - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); + iopt_unpin_pages(&access->ioas_unpin->iopt, iova, length); } EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, "IOMMUFD");
-static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) -{ - if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE) - return false; - - if (!iopt_area_contig_done(iter) && - (iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) % - PAGE_SIZE) != (PAGE_SIZE - 1)) - return false; - return true; -} - -static bool check_area_prot(struct iopt_area *area, unsigned int flags) -{ - if (flags & IOMMUFD_ACCESS_RW_WRITE) - return area->iommu_prot & IOMMU_WRITE; - return area->iommu_prot & IOMMU_READ; -} - /** * iommufd_access_pin_pages() - Return a list of pages under the iova * @access: IOAS access to act on @@ -1315,76 +1274,16 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length, struct page **out_pages, unsigned int flags) { - struct iopt_area_contig_iter iter; - struct io_pagetable *iopt; - unsigned long last_iova; - struct iopt_area *area; - int rc; - /* Driver's ops don't support pin_pages */ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && WARN_ON(access->iova_alignment != PAGE_SIZE || !access->ops->unmap)) return -EINVAL;
- if (!length) - return -EINVAL; - if (check_add_overflow(iova, length - 1, &last_iova)) - return -EOVERFLOW; - - mutex_lock(&access->ioas_lock); - if (!access->ioas) { - mutex_unlock(&access->ioas_lock); + guard(mutex)(&access->ioas_lock); + if (!access->ioas) return -ENOENT; - } - iopt = &access->ioas->iopt; - - down_read(&iopt->iova_rwsem); - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { - unsigned long last = min(last_iova, iopt_area_last_iova(area)); - unsigned long last_index = iopt_area_iova_to_index(area, last); - unsigned long index = - iopt_area_iova_to_index(area, iter.cur_iova); - - if (area->prevent_access || - !iopt_area_contig_is_aligned(&iter)) { - rc = -EINVAL; - goto err_remove; - } - - if (!check_area_prot(area, flags)) { - rc = -EPERM; - goto err_remove; - } - - rc = iopt_area_add_access(area, index, last_index, out_pages, - flags); - if (rc) - goto err_remove; - out_pages += last_index - index + 1; - } - if (!iopt_area_contig_done(&iter)) { - rc = -ENOENT; - goto err_remove; - } - - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); - return 0; - -err_remove: - if (iova < iter.cur_iova) { - last_iova = iter.cur_iova - 1; - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) - iopt_area_remove_access( - area, - iopt_area_iova_to_index(area, iter.cur_iova), - iopt_area_iova_to_index( - area, min(last_iova, - iopt_area_last_iova(area)))); - } - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); - return rc; + return iopt_pin_pages(&access->ioas->iopt, iova, length, out_pages, + flags); } EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, "IOMMUFD");
@@ -1431,7 +1330,7 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, goto err_out; }
- if (!check_area_prot(area, flags)) { + if (!iopt_area_check_prot(area, flags)) { rc = -EPERM; goto err_out; } diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 8a790e597e12..4dfe14dcbb51 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1472,3 +1472,98 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, up_write(&iopt->iova_rwsem); return rc; } + +static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) +{ + if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE) + return false; + + if (!iopt_area_contig_done(iter) && + (iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) % + PAGE_SIZE) != (PAGE_SIZE - 1)) + return false; + return true; +} + +int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length, struct page **out_pages, + unsigned int flags) +{ + struct iopt_area_contig_iter iter; + unsigned long last_iova; + struct iopt_area *area; + int rc; + + if (!length) + return -EINVAL; + if (check_add_overflow(iova, length - 1, &last_iova)) + return -EOVERFLOW; + + down_read(&iopt->iova_rwsem); + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { + unsigned long last = min(last_iova, iopt_area_last_iova(area)); + unsigned long last_index = iopt_area_iova_to_index(area, last); + unsigned long index = + iopt_area_iova_to_index(area, iter.cur_iova); + + if (area->prevent_access || + !iopt_area_contig_is_aligned(&iter)) { + rc = -EINVAL; + goto err_remove; + } + + if (!iopt_area_check_prot(area, flags)) { + rc = -EPERM; + goto err_remove; + } + + rc = iopt_area_add_access(area, index, last_index, out_pages, + flags); + if (rc) + goto err_remove; + out_pages += last_index - index + 1; + } + if (!iopt_area_contig_done(&iter)) { + rc = -ENOENT; + goto err_remove; + } + + up_read(&iopt->iova_rwsem); + return 0; + +err_remove: + if (iova < iter.cur_iova) { + last_iova = iter.cur_iova - 1; + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) + iopt_area_remove_access( + area, + iopt_area_iova_to_index(area, iter.cur_iova), + iopt_area_iova_to_index( + area, min(last_iova, + iopt_area_last_iova(area)))); + } + up_read(&iopt->iova_rwsem); + return rc; +} + +void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length) +{ + struct iopt_area_contig_iter iter; + unsigned long last_iova; + struct iopt_area *area; + + if (WARN_ON(!length) || + WARN_ON(check_add_overflow(iova, length - 1, &last_iova))) + return; + + down_read(&iopt->iova_rwsem); + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) + iopt_area_remove_access( + area, iopt_area_iova_to_index(area, iter.cur_iova), + iopt_area_iova_to_index( + area, + min(last_iova, iopt_area_last_iova(area)))); + WARN_ON(!iopt_area_contig_done(&iter)); + up_read(&iopt->iova_rwsem); +}
On Thu, May 08, 2025 at 08:02:30PM -0700, Nicolin Chen wrote:
The new HW QUEUE object will be added for HW to access the guest queue for HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the HW QUEUE allocator.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/io_pagetable.h | 8 ++ drivers/iommu/iommufd/iommufd_private.h | 6 ++ drivers/iommu/iommufd/device.c | 117 ++---------------------- drivers/iommu/iommufd/io_pagetable.c | 95 +++++++++++++++++++ 4 files changed, 117 insertions(+), 109 deletions(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
The new HW QUEUE object will be added for HW to access the guest queue for HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
'pin' is the generic requirement but contiguity is vendor specific. Let's make it clear.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the HW QUEUE allocator.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing a HW-accelerated queue type of IOMMU's physical queue that can be passed through to a user space VM for direct hardware control, such as: - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
Introduce an allocator iommufd_hw_queue_alloc(). And add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Given that the first user of this HW QUEUE (tegra241-cmdqv) will need to ensure the queue memory to be physically contiguous, add a flag property in iommufd_viommu_ops and IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to allow driver to flag it so that the core will validate the physical pages of a given guest queue.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Reviewed-by: Vasant Hegde vasant.hegde@amd.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 45 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index f6f333eaaae3..0b5bea7286b8 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -37,6 +37,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_VIOMMU, IOMMUFD_OBJ_VDEVICE, IOMMUFD_OBJ_VEVENTQ, + IOMMUFD_OBJ_HW_QUEUE, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif @@ -112,6 +113,18 @@ struct iommufd_vdevice { u64 id; /* per-vIOMMU virtual ID */ };
+struct iommufd_hw_queue { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommufd_viommu *viommu; + u64 base_addr; /* in guest physical address space */ + size_t length; +}; + +enum iommufd_viommu_flags { + IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA = 1 << 0, +}; + /** * struct iommufd_viommu_ops - vIOMMU specific operations * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory @@ -137,8 +150,18 @@ struct iommufd_vdevice { * @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice. * The memory of the vDEVICE will be free-ed by iommufd core * after calling this op + * @hw_queue_alloc: Allocate a HW QUEUE object for a HW-accelerated queue given + * the @type (must be defined in include/uapi/linux/iommufd.h) + * for the @viommu. @index carries the logical HW QUEUE ID per + * @viommu in a guest VM, for a multi-queue case; @addr carries + * the guest physical base address of the queue memory; @length + * carries the size of the queue + * @hw_queue_destroy: Clean up all driver-specific parts of an iommufd_hw_queue. + * The memory of the HW QUEUE will be free-ed by iommufd core + * after calling this op */ struct iommufd_viommu_ops { + u32 flags; void (*destroy)(struct iommufd_viommu *viommu); struct iommu_domain *(*alloc_domain_nested)( struct iommufd_viommu *viommu, u32 flags, @@ -149,6 +172,10 @@ struct iommufd_viommu_ops { struct device *dev, u64 virt_id); void (*vdevice_destroy)(struct iommufd_vdevice *vdev); + struct iommufd_hw_queue *(*hw_queue_alloc)( + struct iommufd_viommu *viommu, unsigned int type, u32 index, + u64 base_addr, size_t length); + void (*hw_queue_destroy)(struct iommufd_hw_queue *hw_queue); };
#if IS_ENABLED(CONFIG_IOMMUFD) @@ -292,6 +319,24 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret; \ })
+#define iommufd_hw_queue_alloc(viommu, drv_struct, member) \ + ({ \ + drv_struct *ret; \ + \ + static_assert(__same_type(struct iommufd_viommu, *viommu)); \ + static_assert(__same_type(struct iommufd_hw_queue, \ + ((drv_struct *)NULL)->member)); \ + static_assert(offsetof(drv_struct, member.obj) == 0); \ + ret = (drv_struct *)_iommufd_object_alloc( \ + viommu->ictx, sizeof(drv_struct), \ + IOMMUFD_OBJ_HW_QUEUE); \ + if (!IS_ERR(ret)) { \ + ret->member.viommu = viommu; \ + ret->member.ictx = viommu->ictx; \ + } \ + ret; \ + }) + /* Helper for IOMMU driver to destroy structures created by allocators above */ #define iommufd_struct_destroy(drv_struct, member) \ ({ \
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing a HW-accelerated queue type of IOMMU's physical queue that can be passed through to a user space VM for direct hardware control, such as:
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
Introduce an allocator iommufd_hw_queue_alloc(). And add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Given that the first user of this HW QUEUE (tegra241-cmdqv) will need to ensure the queue memory to be physically contiguous, add a flag property in iommufd_viommu_ops and IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to allow driver to flag it so that the core will validate the physical pages of a given guest queue.
'READS' is confusing here. What about xxx_CONTIG_PAS?
- @hw_queue_alloc: Allocate a HW QUEUE object for a HW-accelerated
queue given
the @type (must be defined in include/uapi/linux/iommufd.h)
for the @viommu. @index carries the logical HW QUEUE ID per
@viommu in a guest VM, for a multi-queue case; @addr carries
the guest physical base address of the queue memory;
s/@addr/@base_addr/
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 15, 2025 at 05:58:41AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing a HW-accelerated queue type of IOMMU's physical queue that can be passed through to a user space VM for direct hardware control, such as:
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
Introduce an allocator iommufd_hw_queue_alloc(). And add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Given that the first user of this HW QUEUE (tegra241-cmdqv) will need to ensure the queue memory to be physically contiguous, add a flag property in iommufd_viommu_ops and IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to allow driver to flag it so that the core will validate the physical pages of a given guest queue.
'READS' is confusing here. What about xxx_CONTIG_PAS?
Sure.
- @hw_queue_alloc: Allocate a HW QUEUE object for a HW-accelerated
queue given
the @type (must be defined in include/uapi/linux/iommufd.h)
for the @viommu. @index carries the logical HW QUEUE ID per
@viommu in a guest VM, for a multi-queue case; @addr carries
the guest physical base address of the queue memory;
s/@addr/@base_addr/
Reviewed-by: Kevin Tian kevin.tian@intel.com
Thanks! Nicolin
On Thu, May 15, 2025 at 05:58:41AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing a HW-accelerated queue type of IOMMU's physical queue that can be passed through to a user space VM for direct hardware control, such as:
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
Introduce an allocator iommufd_hw_queue_alloc(). And add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Given that the first user of this HW QUEUE (tegra241-cmdqv) will need to ensure the queue memory to be physically contiguous, add a flag property in iommufd_viommu_ops and IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to allow driver to flag it so that the core will validate the physical pages of a given guest queue.
'READS' is confusing here. What about xxx_CONTIG_PAS?
Combining Jason's first comments here: https://lore.kernel.org/linux-iommu/20250515160620.GJ382960@nvidia.com/
So, pinning should be optional too. And I think there would be unlikely a case where HW needs contiguous physical pages while not requiring to pin the pages, right?
So, we need an flag that could indicate to do both tests. Yet, "xxx_CONTIG_PAS" doesn't sound very fitting, compared to this "IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA".
Perhaps, we should just add some comments to clarify a bit. Or do you have some better naming?
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 10:30 AM
On Thu, May 15, 2025 at 05:58:41AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing a HW-accelerated queue type of IOMMU's physical queue that can be
passed
through to a user space VM for direct hardware control, such as:
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
Introduce an allocator iommufd_hw_queue_alloc(). And add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Given that the first user of this HW QUEUE (tegra241-cmdqv) will need to ensure the queue memory to be physically contiguous, add a flag
property
in iommufd_viommu_ops and IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to allow driver to flag it so that the core will validate the physical pages of a given guest queue.
'READS' is confusing here. What about xxx_CONTIG_PAS?
Combining Jason's first comments here: https://lore.kernel.org/linux- iommu/20250515160620.GJ382960@nvidia.com/
So, pinning should be optional too. And I think there would be unlikely a case where HW needs contiguous physical pages while not requiring to pin the pages, right?
So, we need an flag that could indicate to do both tests. Yet, "xxx_CONTIG_PAS" doesn't sound very fitting, compared to this "IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA".
Perhaps, we should just add some comments to clarify a bit. Or do you have some better naming?
let's wait until that open is closed, i.e. whether we still let the core manage it and whether AMD requires pinning even when IOVA is used.
On Thu, May 08, 2025 at 08:02:31PM -0700, Nicolin Chen wrote:
+#define iommufd_hw_queue_alloc(viommu, drv_struct, member) \
- ({ \
drv_struct *ret; \
\
static_assert(__same_type(struct iommufd_viommu, *viommu)); \
static_assert(__same_type(struct iommufd_hw_queue, \
((drv_struct *)NULL)->member)); \
static_assert(offsetof(drv_struct, member.obj) == 0); \
ret = (drv_struct *)_iommufd_object_alloc( \
viommu->ictx, sizeof(drv_struct), \
IOMMUFD_OBJ_HW_QUEUE); \
if (!IS_ERR(ret)) { \
ret->member.viommu = viommu; \
ret->member.ictx = viommu->ictx; \
} \
ret; \
- })
This should just call
__iommufd_object_alloc(viommu->ictx, ret, IOMMUFD_OBJ_HW_QUEUE member.obj)
And all the casting and asserts should be in that macro, move it to this header
/* Helper for IOMMU driver to destroy structures created by allocators above */ #define iommufd_struct_destroy(drv_struct, member) \ ({ \
This is abort not destroy, the names should be consistent. But looking more at the series I think it will be better to do the little rework I suggested and not give this function to the driver in the first place.
Jason
On Thu, May 15, 2025 at 12:39:03PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:31PM -0700, Nicolin Chen wrote:
+#define iommufd_hw_queue_alloc(viommu, drv_struct, member) \
- ({ \
drv_struct *ret; \
\
static_assert(__same_type(struct iommufd_viommu, *viommu)); \
static_assert(__same_type(struct iommufd_hw_queue, \
((drv_struct *)NULL)->member)); \
static_assert(offsetof(drv_struct, member.obj) == 0); \
ret = (drv_struct *)_iommufd_object_alloc( \
viommu->ictx, sizeof(drv_struct), \
IOMMUFD_OBJ_HW_QUEUE); \
if (!IS_ERR(ret)) { \
ret->member.viommu = viommu; \
ret->member.ictx = viommu->ictx; \
} \
ret; \
- })
This should just call
__iommufd_object_alloc(viommu->ictx, ret, IOMMUFD_OBJ_HW_QUEUE member.obj)
And all the casting and asserts should be in that macro, move it to this header
Ack.
/* Helper for IOMMU driver to destroy structures created by allocators above */ #define iommufd_struct_destroy(drv_struct, member) \ ({ \
This is abort not destroy, the names should be consistent. But looking more at the series I think it will be better to do the little rework I suggested and not give this function to the driver in the first place.
Yea, it will be replaced in v5.
Thanks Nicolin
Introduce a new IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl for user space to allocate a HW QUEUE object for a vIOMMU specific HW-accelerated queue, e.g.: - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
This is a vIOMMU based ioctl. Simply increase the refcount of the vIOMMU.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 2 + include/uapi/linux/iommufd.h | 42 ++++++++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 104 ++++++++++++++++++++++++ 4 files changed, 154 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 79160b039bc7..36a4e060982f 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -611,6 +611,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj); +int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_hw_queue_destroy(struct iommufd_object *obj);
#ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cc90299a08d9..001e2deb5a2d 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -56,6 +56,7 @@ enum { IOMMUFD_CMD_VDEVICE_ALLOC = 0x91, IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92, IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93, + IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94, };
/** @@ -1147,4 +1148,45 @@ struct iommu_veventq_alloc { __u32 __reserved; }; #define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC) + +/** + * enum iommu_hw_queue_type - HW Queue Type + * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use + */ +enum iommu_hw_queue_type { + IOMMU_HW_QUEUE_TYPE_DEFAULT = 0, +}; + +/** + * struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC) + * @size: sizeof(struct iommu_hw_queue_alloc) + * @flags: Must be 0 + * @viommu_id: Virtual IOMMU ID to associate the HW queue with + * @type: One of enum iommu_hw_queue_type + * @index: The logical index to the HW queue per virtual IOMMU for a multi-queue + * model + * @out_hw_queue_id: The ID of the new HW queue + * @base_addr: Base address of the queue memory in guest physical address space + * @length: Length of the queue memory in the guest physical address space + * + * Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, which + * allows HW to access a guest queue memory described by @base_addr and @length. + * Upon success, the underlying physical pages of the guest queue memory will be + * pinned to prevent VMM from unmapping them in the IOAS until the HW queue gets + * destroyed. + * + * A vIOMMU can allocate multiple queues, but it must use a different @index to + * separate each allocation, e.g. HW queue0, HW queue1, ... + */ +struct iommu_hw_queue_alloc { + __u32 size; + __u32 flags; + __u32 viommu_id; + __u32 type; + __u32 index; + __u32 out_hw_queue_id; + __aligned_u64 base_addr; + __aligned_u64 length; +}; +#define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC) #endif diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 2b9ee9b4a424..10410e2f710a 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -292,6 +292,7 @@ union ucmd_buffer { struct iommu_destroy destroy; struct iommu_fault_alloc fault; struct iommu_hw_info info; + struct iommu_hw_queue_alloc hw_queue; struct iommu_hwpt_alloc hwpt; struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap; struct iommu_hwpt_invalidate cache; @@ -334,6 +335,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { struct iommu_fault_alloc, out_fault_fd), IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info, __reserved), + IOCTL_OP(IOMMU_HW_QUEUE_ALLOC, iommufd_hw_queue_alloc_ioctl, + struct iommu_hw_queue_alloc, length), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, __reserved), IOCTL_OP(IOMMU_HWPT_GET_DIRTY_BITMAP, iommufd_hwpt_get_dirty_bitmap, @@ -490,6 +493,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_FAULT] = { .destroy = iommufd_fault_destroy, }, + [IOMMUFD_OBJ_HW_QUEUE] = { + .destroy = iommufd_hw_queue_destroy, + }, [IOMMUFD_OBJ_HWPT_PAGING] = { .destroy = iommufd_hwpt_paging_destroy, .abort = iommufd_hwpt_paging_abort, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..ef41aec0b5bf 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,107 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; } + +void iommufd_hw_queue_destroy(struct iommufd_object *obj) +{ + struct iommufd_hw_queue *hw_queue = + container_of(obj, struct iommufd_hw_queue, obj); + struct iommufd_viommu *viommu = hw_queue->viommu; + + if (viommu->ops->hw_queue_destroy) + viommu->ops->hw_queue_destroy(hw_queue); + iopt_unpin_pages(&viommu->hwpt->ioas->iopt, hw_queue->base_addr, + hw_queue->length); + refcount_dec(&viommu->obj.users); +} + +int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd) +{ + struct iommu_hw_queue_alloc *cmd = ucmd->cmd; + struct iommufd_hw_queue *hw_queue; + struct iommufd_hwpt_paging *hwpt; + struct iommufd_viommu *viommu; + struct page **pages; + int max_npages, i; + u64 end; + int rc; + + if (cmd->flags || cmd->type == IOMMU_HW_QUEUE_TYPE_DEFAULT) + return -EOPNOTSUPP; + if (!cmd->base_addr || !cmd->length) + return -EINVAL; + if (check_add_overflow(cmd->base_addr, cmd->length - 1, &end)) + return -EOVERFLOW; + + max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE); + pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); + if (IS_ERR(viommu)) { + rc = PTR_ERR(viommu); + goto out_free; + } + hwpt = viommu->hwpt; + + if (!viommu->ops || !viommu->ops->hw_queue_alloc) { + rc = -EOPNOTSUPP; + goto out_put_viommu; + } + + /* Quick test on the base address */ + if (!iommu_iova_to_phys(hwpt->common.domain, cmd->base_addr)) { + rc = -ENXIO; + goto out_put_viommu; + } + + /* + * The underlying physical pages must be pinned to prevent them from + * being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the life cycle + * of the HW QUEUE object. + */ + rc = iopt_pin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd->length, + pages, 0); + if (rc) + goto out_put_viommu; + + if (viommu->ops->flags & IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA) { + /* Validate if the underlying physical pages are contiguous */ + for (i = 1; i < max_npages && pages[i]; i++) { + if (page_to_pfn(pages[i]) == + page_to_pfn(pages[i - 1]) + 1) + continue; + rc = -EFAULT; + goto out_unpin; + } + } + + hw_queue = viommu->ops->hw_queue_alloc(viommu, cmd->type, cmd->index, + cmd->base_addr, cmd->length); + if (IS_ERR(hw_queue)) { + rc = PTR_ERR(hw_queue); + goto out_unpin; + } + + hw_queue->viommu = viommu; + refcount_inc(&viommu->obj.users); + hw_queue->ictx = ucmd->ictx; + hw_queue->length = cmd->length; + hw_queue->base_addr = cmd->base_addr; + cmd->out_hw_queue_id = hw_queue->obj.id; + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + iommufd_object_abort_and_destroy(ucmd->ictx, &hw_queue->obj); + else + iommufd_object_finalize(ucmd->ictx, &hw_queue->obj); + goto out_put_viommu; + +out_unpin: + iopt_unpin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd->length); +out_put_viommu: + iommufd_put_object(ucmd->ictx, &viommu->obj); +out_free: + kfree(pages); + return rc; +}
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
- @out_hw_queue_id: The ID of the new HW queue
- @base_addr: Base address of the queue memory in guest physical
address space
- @length: Length of the queue memory in the guest physical address
space
length is agnostic to address space.
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_hw_queue_alloc *cmd = ucmd->cmd;
- struct iommufd_hw_queue *hw_queue;
- struct iommufd_hwpt_paging *hwpt;
- struct iommufd_viommu *viommu;
- struct page **pages;
- int max_npages, i;
- u64 end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_HW_QUEUE_TYPE_DEFAULT)
return -EOPNOTSUPP;
- if (!cmd->base_addr || !cmd->length)
return -EINVAL;
- if (check_add_overflow(cmd->base_addr, cmd->length - 1, &end))
return -EOVERFLOW;
- max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE);
what about [base_addr, base_addr+length) spanning two pages but 'length' is smaller than the size of one page?
- pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);
- if (!pages)
return -ENOMEM;
this could be moved to right before iopt_pin_pages().
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu)) {
rc = PTR_ERR(viommu);
goto out_free;
- }
- hwpt = viommu->hwpt;
- if (!viommu->ops || !viommu->ops->hw_queue_alloc) {
rc = -EOPNOTSUPP;
goto out_put_viommu;
- }
- /* Quick test on the base address */
- if (!iommu_iova_to_phys(hwpt->common.domain, cmd->base_addr))
{
rc = -ENXIO;
goto out_put_viommu;
- }
this check is redundant. Actually it's not future proof, assuming that S2 must be pinned before the user attempts to call this cmd. But what if one day iommufd supports unpinned S2 (if a device is 100% PRI faultable) then this path will be broken.
- /*
* The underlying physical pages must be pinned to prevent them
from
* being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the
life cycle
* of the HW QUEUE object.
*/
- rc = iopt_pin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd-
length,
pages, 0);
- if (rc)
goto out_put_viommu;
- if (viommu->ops->flags &
IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA) {
/* Validate if the underlying physical pages are contiguous */
for (i = 1; i < max_npages && pages[i]; i++) {
if (page_to_pfn(pages[i]) ==
page_to_pfn(pages[i - 1]) + 1)
continue;
rc = -EFAULT;
goto out_unpin;
}
- }
- hw_queue = viommu->ops->hw_queue_alloc(viommu, cmd->type,
cmd->index,
cmd->base_addr, cmd->length);
- if (IS_ERR(hw_queue)) {
rc = PTR_ERR(hw_queue);
goto out_unpin;
- }
- hw_queue->viommu = viommu;
- refcount_inc(&viommu->obj.users);
- hw_queue->ictx = ucmd->ictx;
viommu/ictx are already set by iommufd_hw_queue_alloc().
- hw_queue->length = cmd->length;
- hw_queue->base_addr = cmd->base_addr;
- cmd->out_hw_queue_id = hw_queue->obj.id;
- rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
- if (rc)
iommufd_object_abort_and_destroy(ucmd->ictx,
&hw_queue->obj);
- else
iommufd_object_finalize(ucmd->ictx, &hw_queue->obj);
- goto out_put_viommu;
+out_unpin:
- iopt_unpin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd-
length);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
+out_free:
- kfree(pages);
- return rc;
+}
2.43.0
On Thu, May 15, 2025 at 06:30:27AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
I have thought about that too. It's nice that the core can easily maintain the dependency for the driver.
But there would still need an out_index to mark each dynamically allocated queue. So VMM would know where it should map the queue.
For example, if VMM wants to allocate a queue at its own index=1 without allocating index=0 first, kernel cannot fail that as VMM doesn't provide the index. The only way left for kernel would be to output the allocated queue with index=0 and then wish VMM can validate it, which doesn't sound safe..
- @out_hw_queue_id: The ID of the new HW queue
- @base_addr: Base address of the queue memory in guest physical
address space
- @length: Length of the queue memory in the guest physical address
space
length is agnostic to address space.
Ack.
* @length: Length of the queue memory
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_hw_queue_alloc *cmd = ucmd->cmd;
- struct iommufd_hw_queue *hw_queue;
- struct iommufd_hwpt_paging *hwpt;
- struct iommufd_viommu *viommu;
- struct page **pages;
- int max_npages, i;
- u64 end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_HW_QUEUE_TYPE_DEFAULT)
return -EOPNOTSUPP;
- if (!cmd->base_addr || !cmd->length)
return -EINVAL;
- if (check_add_overflow(cmd->base_addr, cmd->length - 1, &end))
return -EOVERFLOW;
- max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE);
what about [base_addr, base_addr+length) spanning two pages but 'length' is smaller than the size of one page?
Ah, right! Probably should be something like:
offset = cmd->base_addr - PAGE_ALIGN(cmd->base_addr); max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE);
- pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);
- if (!pages)
return -ENOMEM;
this could be moved to right before iopt_pin_pages().
Ack.
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu)) {
rc = PTR_ERR(viommu);
goto out_free;
- }
- hwpt = viommu->hwpt;
- if (!viommu->ops || !viommu->ops->hw_queue_alloc) {
rc = -EOPNOTSUPP;
goto out_put_viommu;
- }
- /* Quick test on the base address */
- if (!iommu_iova_to_phys(hwpt->common.domain, cmd->base_addr))
{
rc = -ENXIO;
goto out_put_viommu;
- }
this check is redundant. Actually it's not future proof, assuming that S2 must be pinned before the user attempts to call this cmd. But what if one day iommufd supports unpinned S2 (if a device is 100% PRI faultable) then this path will be broken.
OK. Let's drop it.
- hw_queue = viommu->ops->hw_queue_alloc(viommu, cmd->type,
cmd->index,
cmd->base_addr, cmd->length);
- if (IS_ERR(hw_queue)) {
rc = PTR_ERR(hw_queue);
goto out_unpin;
- }
- hw_queue->viommu = viommu;
- refcount_inc(&viommu->obj.users);
- hw_queue->ictx = ucmd->ictx;
viommu/ictx are already set by iommufd_hw_queue_alloc().
OK. We'd need to be careful if someday there is a core-allocated hw_queue that doesn't call iommufd_hw_queue_alloc(). Maybe I can put a note here.
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 2:45 AM
On Thu, May 15, 2025 at 06:30:27AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
I have thought about that too. It's nice that the core can easily maintain the dependency for the driver.
But there would still need an out_index to mark each dynamically allocated queue. So VMM would know where it should map the queue.
For example, if VMM wants to allocate a queue at its own index=1 without allocating index=0 first, kernel cannot fail that as VMM doesn't provide the index. The only way left for kernel would be to output the allocated queue with index=0 and then wish VMM can validate it, which doesn't sound safe..
VMM's index is virtual which could be mapped to whatever queue object created at its own disposal.
the uAPI just requires VMM to remember a sequential list of allocated queue objects and destroy them in reverse order of allocation, instead of in the reverse order of virtual indexes.
On Fri, May 16, 2025 at 02:49:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 2:45 AM
On Thu, May 15, 2025 at 06:30:27AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
I have thought about that too. It's nice that the core can easily maintain the dependency for the driver.
But there would still need an out_index to mark each dynamically allocated queue. So VMM would know where it should map the queue.
For example, if VMM wants to allocate a queue at its own index=1 without allocating index=0 first, kernel cannot fail that as VMM doesn't provide the index. The only way left for kernel would be to output the allocated queue with index=0 and then wish VMM can validate it, which doesn't sound safe..
VMM's index is virtual which could be mapped to whatever queue object created at its own disposal.
the uAPI just requires VMM to remember a sequential list of allocated queue objects and destroy them in reverse order of allocation, instead of in the reverse order of virtual indexes.
But that's not going to work for VCMDQ.
VINTF mmaps only a single page that controls multiple queues. And all queues have to be mapped correctly between HW and VM indexes. Otherwise, it won't work if VMM maps:
HW-level VINTF1 LVCMDQ0 <==> VM-level VINTF0 LVCMDQ1 HW-level VINTF1 LVCMDQ1 <==> VM-level VINTF0 LVCMDQ0
So, one way or another, kernel has to ensure the static mappings of the indexes. And I think it's safer in the way that VMM tells what index to allocate..
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 11:17 AM
On Fri, May 16, 2025 at 02:49:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 2:45 AM
On Thu, May 15, 2025 at 06:30:27AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for
a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
I have thought about that too. It's nice that the core can easily maintain the dependency for the driver.
But there would still need an out_index to mark each dynamically allocated queue. So VMM would know where it should map the queue.
For example, if VMM wants to allocate a queue at its own index=1 without allocating index=0 first, kernel cannot fail that as VMM doesn't provide the index. The only way left for kernel would be to output the allocated queue with index=0 and then wish VMM can validate it, which doesn't sound safe..
VMM's index is virtual which could be mapped to whatever queue object created at its own disposal.
the uAPI just requires VMM to remember a sequential list of allocated queue objects and destroy them in reverse order of allocation, instead of in the reverse order of virtual indexes.
But that's not going to work for VCMDQ.
VINTF mmaps only a single page that controls multiple queues. And all queues have to be mapped correctly between HW and VM indexes. Otherwise, it won't work if VMM maps:
HW-level VINTF1 LVCMDQ0 <==> VM-level VINTF0 LVCMDQ1 HW-level VINTF1 LVCMDQ1 <==> VM-level VINTF0 LVCMDQ0
So, one way or another, kernel has to ensure the static mappings of the indexes. And I think it's safer in the way that VMM tells what index to allocate..
Okay, that's a valid point.
But hey, we are already adding various restrictions to the uAPI about dependency, contiguity, etc. which the VMM should conform to. What hurts if we further say that the VMM should allocate virtual index in an ascending order along with hw queue allocation?
On Fri, May 16, 2025 at 03:52:16AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 11:17 AM
On Fri, May 16, 2025 at 02:49:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 2:45 AM
On Thu, May 15, 2025 at 06:30:27AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for
a
multi-queue
model
I'm thinking of an alternative way w/o having the user to assign index and allowing the driver to poke object dependency (next patch).
Let's say the index is internally assigned by the driver. so this cmd is just for allowing a hw queue and it's the driver to decide the allocation policy, e.g. in ascending order.
Introduce a new flag in viommu_ops to indicate to core that the new hw queue should hold a reference to the previous hw queue.
core maintains a last_queue field in viommu. Upon success return from @hw_queue_alloc() the core increments the users refcnt of last_queue, records the dependency in iommufd_hw_queue struct, and update viommu->last_queue.
Then the destroy order is naturally guaranteed.
I have thought about that too. It's nice that the core can easily maintain the dependency for the driver.
But there would still need an out_index to mark each dynamically allocated queue. So VMM would know where it should map the queue.
For example, if VMM wants to allocate a queue at its own index=1 without allocating index=0 first, kernel cannot fail that as VMM doesn't provide the index. The only way left for kernel would be to output the allocated queue with index=0 and then wish VMM can validate it, which doesn't sound safe..
VMM's index is virtual which could be mapped to whatever queue object created at its own disposal.
the uAPI just requires VMM to remember a sequential list of allocated queue objects and destroy them in reverse order of allocation, instead of in the reverse order of virtual indexes.
But that's not going to work for VCMDQ.
VINTF mmaps only a single page that controls multiple queues. And all queues have to be mapped correctly between HW and VM indexes. Otherwise, it won't work if VMM maps:
HW-level VINTF1 LVCMDQ0 <==> VM-level VINTF0 LVCMDQ1 HW-level VINTF1 LVCMDQ1 <==> VM-level VINTF0 LVCMDQ0
So, one way or another, kernel has to ensure the static mappings of the indexes. And I think it's safer in the way that VMM tells what index to allocate..
Okay, that's a valid point.
But hey, we are already adding various restrictions to the uAPI about dependency, contiguity, etc. which the VMM should conform to. What hurts if we further say that the VMM should allocate virtual index in an ascending order along with hw queue allocation?
You mean adding another flag to manage the dependency in the core, right?
I talked with Jason offline when adding that depend API. He didn't want it to be in the core, saying that is a driver thing.
But that was before we added pin and contiguity, which he doesn't really enjoy being in the core either.
So, yea, I think you have a point here..
@Jason?
Thanks Nicolin
On Thu, May 08, 2025 at 08:02:32PM -0700, Nicolin Chen wrote:
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a multi-queue
model
- @out_hw_queue_id: The ID of the new HW queue
- @base_addr: Base address of the queue memory in guest physical address space
- @length: Length of the queue memory in the guest physical address space
- Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, which
- allows HW to access a guest queue memory described by @base_addr and @length.
- Upon success, the underlying physical pages of the guest queue memory will be
- pinned to prevent VMM from unmapping them in the IOAS until the HW queue gets
- destroyed.
Do we have way to make the pinning optional?
As I understand AMD's system the iommu HW itself translates the base_addr through the S2 page table automatically, so it doesn't need pinned memory and physical addresses but just the IOVA.
Perhaps for this reason the pinning should be done with a function call from the driver?
+struct iommu_hw_queue_alloc {
- __u32 size;
- __u32 flags;
- __u32 viommu_id;
- __u32 type;
- __u32 index;
- __u32 out_hw_queue_id;
- __aligned_u64 base_addr;
base addr should probably be called nesting_parent_iova to match how we described the viommu hwpt:
* @hwpt_id: ID of a nesting parent HWPT to associate to
- /*
* The underlying physical pages must be pinned to prevent them from
* being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the life cycle
* of the HW QUEUE object.
*/
- rc = iopt_pin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd->length,
pages, 0);
I don't think this actually works like this without an unmap callback. unmap will break:
iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) return -EDEADLOCK;
If it can't shoot down the pinning.
Why did this need to change away from just a normal access? That ops and unmap callback are not optional things.
What vcmdq would do in the unmap callback is question I'm not sure of..
This is more reason to put the pin/access in the driver so it can provide an unmap callback that can fix it up. I think this should have been done just by using the normal access mechanism, maybe with a simplifying wrapper for in-driver use. ie no need for patch #9
Jason
On Thu, May 15, 2025 at 01:06:20PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:32PM -0700, Nicolin Chen wrote:
+/**
- struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
- @size: sizeof(struct iommu_hw_queue_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the HW queue with
- @type: One of enum iommu_hw_queue_type
- @index: The logical index to the HW queue per virtual IOMMU for a multi-queue
model
- @out_hw_queue_id: The ID of the new HW queue
- @base_addr: Base address of the queue memory in guest physical address space
- @length: Length of the queue memory in the guest physical address space
- Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, which
- allows HW to access a guest queue memory described by @base_addr and @length.
- Upon success, the underlying physical pages of the guest queue memory will be
- pinned to prevent VMM from unmapping them in the IOAS until the HW queue gets
- destroyed.
Do we have way to make the pinning optional?
As I understand AMD's system the iommu HW itself translates the base_addr through the S2 page table automatically, so it doesn't need pinned memory and physical addresses but just the IOVA.
Right. That's why we invented a flag, and it should be probably extended to cover the pin step as well.
Perhaps for this reason the pinning should be done with a function call from the driver?
But the whole point of doing in the core was to avoid the entire iopt related structure/function from being exposed to the driver, which would otherwise hugely increase the size of the driver.o?
+struct iommu_hw_queue_alloc {
- __u32 size;
- __u32 flags;
- __u32 viommu_id;
- __u32 type;
- __u32 index;
- __u32 out_hw_queue_id;
- __aligned_u64 base_addr;
base addr should probably be called nesting_parent_iova to match how we described the viommu hwpt:
- @hwpt_id: ID of a nesting parent HWPT to associate to
Ack.
- /*
* The underlying physical pages must be pinned to prevent them from
* being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the life cycle
* of the HW QUEUE object.
*/
- rc = iopt_pin_pages(&hwpt->ioas->iopt, cmd->base_addr, cmd->length,
pages, 0);
I don't think this actually works like this without an unmap callback. unmap will break:
iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) return -EDEADLOCK;
If it can't shoot down the pinning.
Hmm, I thought we want the unmap to fail until VMM releases the HW QUEUE first? In what case, does VMM wants to unmap while holding the queue pages?
Why did this need to change away from just a normal access? That ops and unmap callback are not optional things.
What vcmdq would do in the unmap callback is question I'm not sure of..
This is more reason to put the pin/access in the driver so it can provide an unmap callback that can fix it up.
As there are two types of "access" here.. you mean iommufd_access, i.e. vcmdq driver should hold an iommufd_access like an emulated vfio device driver?
I think this should have been done just by using the normal access mechanism, maybe with a simplifying wrapper for in-driver use. ie no need for patch #9
Still, the driver.o file will be very large, unlike VFIO that just depends on CONFIG_IOMMUFD?
Thanks Nicolin
On Thu, May 15, 2025 at 11:16:45AM -0700, Nicolin Chen wrote:
As I understand AMD's system the iommu HW itself translates the base_addr through the S2 page table automatically, so it doesn't need pinned memory and physical addresses but just the IOVA.
Right. That's why we invented a flag, and it should be probably extended to cover the pin step as well.
Yes, no pin
Perhaps for this reason the pinning should be done with a function call from the driver?
But the whole point of doing in the core was to avoid the entire iopt related structure/function from being exposed to the driver, which would otherwise hugely increase the size of the driver.o?
Ugh, yes, but also, maybe we need to figure something else out for this. Pass down a function pointers struct to the driver or something like that?
I don't think this actually works like this without an unmap callback. unmap will break:
iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) return -EDEADLOCK;
If it can't shoot down the pinning.
Hmm, I thought we want the unmap to fail until VMM releases the HW QUEUE first? In what case, does VMM wants to unmap while holding the queue pages?
Well, if that is what we want to do then this needs to be revised somehow..
This is more reason to put the pin/access in the driver so it can provide an unmap callback that can fix it up.
As there are two types of "access" here.. you mean iommufd_access, i.e. vcmdq driver should hold an iommufd_access like an emulated vfio device driver?
Yes.
Jason
On Thu, May 15, 2025 at 03:59:38PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 11:16:45AM -0700, Nicolin Chen wrote:
I don't think this actually works like this without an unmap callback. unmap will break:
iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) return -EDEADLOCK;
If it can't shoot down the pinning.
Hmm, I thought we want the unmap to fail until VMM releases the HW QUEUE first? In what case, does VMM wants to unmap while holding the queue pages?
Well, if that is what we want to do then this needs to be revised somehow..
Yea, unless we have a strong reason to allow unmap while holding the HW queue.
I think we could set a new flag:
enum { IOMMUFD_ACCESS_RW_READ = 0, IOMMUFD_ACCESS_RW_WRITE = 1 << 0, /* Set if the caller is in a kthread then rw will use kthread_use_mm() */ IOMMUFD_ACCESS_RW_KTHREAD = 1 << 1, + IOMMUFD_ACCESS_NO_UNMAP = 1 << 3,
/* Only for use by selftest */ __IOMMUFD_ACCESS_RW_SLOW_PATH = 1 << 2, };
and reject iopt_unmap_iova_range().
Thanks Nicolin
On Thu, May 15, 2025 at 01:32:48PM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 03:59:38PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 11:16:45AM -0700, Nicolin Chen wrote:
I don't think this actually works like this without an unmap callback. unmap will break:
iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) return -EDEADLOCK;
If it can't shoot down the pinning.
Hmm, I thought we want the unmap to fail until VMM releases the HW QUEUE first? In what case, does VMM wants to unmap while holding the queue pages?
Well, if that is what we want to do then this needs to be revised somehow..
Yea, unless we have a strong reason to allow unmap while holding the HW queue.
I think we could set a new flag:
enum { IOMMUFD_ACCESS_RW_READ = 0, IOMMUFD_ACCESS_RW_WRITE = 1 << 0, /* Set if the caller is in a kthread then rw will use kthread_use_mm() */ IOMMUFD_ACCESS_RW_KTHREAD = 1 << 1,
- IOMMUFD_ACCESS_NO_UNMAP = 1 << 3,
/* Only for use by selftest */ __IOMMUFD_ACCESS_RW_SLOW_PATH = 1 << 2, };
and reject iopt_unmap_iova_range().
Okay, it would need a patch for this too. I think we wanted to limit this no_unmap behavior though. Linking it to deliberate action that the user took to create a vqueue with a user provided address seems reasonable
I would probably put the flag out of the public header though, just to prevent abuse from mdev drivers.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, May 16, 2025 12:06 AM
Do we have way to make the pinning optional?
As I understand AMD's system the iommu HW itself translates the base_addr through the S2 page table automatically, so it doesn't need pinned memory and physical addresses but just the IOVA.
Though using IOVA could eliminate pinning conceptually, implementation wise an IOMMU may not tolerate translation errors in its access to guest queues with assumption that S2 is pinned.
@Vasant, can you help confirm?
On Fri, May 16, 2025 at 02:42:32AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, May 16, 2025 12:06 AM
Do we have way to make the pinning optional?
As I understand AMD's system the iommu HW itself translates the base_addr through the S2 page table automatically, so it doesn't need pinned memory and physical addresses but just the IOVA.
Though using IOVA could eliminate pinning conceptually, implementation wise an IOMMU may not tolerate translation errors in its access to guest queues with assumption that S2 is pinned.
Yes, but the entire S2 is pinned today. This isn't about transient unmap..
If the VMM decides to unmap the memory, eg with hotunplug or something, then I'd fully expect the IOMMU to take a fault and forward the error to the guest. Guest make a mistake to put the queue in memory that was hot-unplugged.
Jason
NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU features to user space VMs. Its hardware has a strict rule when mapping and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring mappings in ascending order and unmappings in descending order.
The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler. However, it can't do the same for an unmapping since user space could start random destroy calls breaking the rule, while the destroy op in the driver level can't reject a destroy call as it returns void.
Add iommufd_hw_queue_depend/undepend for-driver helpers, allowing LVCMDQ allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer to refcount_dec(), so that iommufd core will help block a random destroy call that breaks the rule.
This is a bit of compromise, because a driver might end up with abusing the API that deadlocks the objects. So restrict the API to a dependency between two driver-allocated objects of the same type, as iommufd would unlikely build any core-level dependency in this case. And encourage to use the macro version that currently supports the HW QUEUE objects only.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 44 ++++++++++++++++++++++++++++++++++ drivers/iommu/iommufd/driver.c | 28 ++++++++++++++++++++++ 2 files changed, 72 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 0b5bea7286b8..ddca0d2835df 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -241,6 +241,10 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, enum iommufd_object_type type); void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); +int _iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended); +void _iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -261,6 +265,18 @@ static inline void iommufd_object_abort(struct iommufd_ctx *ictx, { }
+static inline int _iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + return -EOPNOTSUPP; +} + +static inline void +_iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -346,4 +362,32 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, iommufd_object_abort(drv_struct->member.ictx, \ &drv_struct->member.obj); \ }) + +/* + * Helpers for IOMMU driver to build/destroy a dependency between two sibling + * structures created by one of the allocators above + */ +#define iommufd_hw_queue_depend(dependent, depended, member) \ + ({ \ + static_assert(__same_type(struct iommufd_hw_queue, \ + dependent->member)); \ + static_assert(offsetof(typeof(*dependent), member.obj) == 0); \ + static_assert(__same_type(struct iommufd_hw_queue, \ + depended->member)); \ + static_assert(offsetof(typeof(*depended), member.obj) == 0); \ + _iommufd_object_depend(&dependent->member.obj, \ + &depended->member.obj); \ + }) + +#define iommufd_hw_queue_undepend(dependent, depended, member) \ + ({ \ + static_assert(__same_type(struct iommufd_hw_queue, \ + dependent->member)); \ + static_assert(offsetof(typeof(*dependent), member.obj) == 0); \ + static_assert(__same_type(struct iommufd_hw_queue, \ + depended->member)); \ + static_assert(offsetof(typeof(*depended), member.obj) == 0); \ + _iommufd_object_undepend(&dependent->member.obj, \ + &depended->member.obj); \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 7980a09761c2..0bcf0438d255 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -50,6 +50,34 @@ void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) } EXPORT_SYMBOL_NS_GPL(iommufd_object_abort, "IOMMUFD");
+/* Driver should use a per-structure helper in include/linux/iommufd.h */ +int _iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + /* Reject self dependency that dead locks */ + if (obj_dependent == obj_depended) + return -EINVAL; + /* Only support dependency between two objects of the same type */ + if (obj_dependent->type != obj_depended->type) + return -EINVAL; + + refcount_inc(&obj_depended->users); + return 0; +} +EXPORT_SYMBOL_NS_GPL(_iommufd_object_depend, "IOMMUFD"); + +/* Driver should use a per-structure helper in include/linux/iommufd.h */ +void _iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + if (WARN_ON_ONCE(obj_dependent == obj_depended || + obj_dependent->type != obj_depended->type)) + return; + + refcount_dec(&obj_depended->users); +} +EXPORT_SYMBOL_NS_GPL(_iommufd_object_undepend, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
On Thu, May 08, 2025 at 08:02:33PM -0700, Nicolin Chen wrote:
+/*
- Helpers for IOMMU driver to build/destroy a dependency between two sibling
- structures created by one of the allocators above
- */
+#define iommufd_hw_queue_depend(dependent, depended, member) \
- ({ \
static_assert(__same_type(struct iommufd_hw_queue, \
dependent->member)); \
static_assert(offsetof(typeof(*dependent), member.obj) == 0); \
static_assert(__same_type(struct iommufd_hw_queue, \
depended->member)); \
static_assert(offsetof(typeof(*depended), member.obj) == 0); \
_iommufd_object_depend(&dependent->member.obj, \
&depended->member.obj); \
- })
This doesn't need the offsetof == 0 checks, it isn't an allocator. And you want to check that the two structs have the same type:
static_assert(__same_type(struct iommufd_hw_queue, dependent->member)); static_assert(__same_type(typeof(*dependent), *dependend));
Jason
On Thu, May 15, 2025 at 01:12:54PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:33PM -0700, Nicolin Chen wrote:
+/*
- Helpers for IOMMU driver to build/destroy a dependency between two sibling
- structures created by one of the allocators above
- */
+#define iommufd_hw_queue_depend(dependent, depended, member) \
- ({ \
static_assert(__same_type(struct iommufd_hw_queue, \
dependent->member)); \
static_assert(offsetof(typeof(*dependent), member.obj) == 0); \
static_assert(__same_type(struct iommufd_hw_queue, \
depended->member)); \
static_assert(offsetof(typeof(*depended), member.obj) == 0); \
_iommufd_object_depend(&dependent->member.obj, \
&depended->member.obj); \
- })
This doesn't need the offsetof == 0 checks, it isn't an allocator. And you want to check that the two structs have the same type:
static_assert(__same_type(struct iommufd_hw_queue, dependent->member)); static_assert(__same_type(typeof(*dependent), *dependend));
Ack. I also added ictx comparison:
+#define iommufd_hw_queue_depend(dependent, depended, member) \ + ({ \ + int ret = -EINVAL; \ + \ + static_assert(__same_type(struct iommufd_hw_queue, \ + dependent->member)); \ + static_assert(__same_type(typeof(*dependent), *depended)); \ + if (!WARN_ON_ONCE(dependent->member.ictx != \ + depended->member.ictx)) \ + ret = _iommufd_object_depend(&dependent->member.obj, \ + &depended->member.obj); \ + ret; \ + }) + +#define iommufd_hw_queue_undepend(dependent, depended, member) \ + ({ \ + static_assert(__same_type(struct iommufd_hw_queue, \ + dependent->member)); \ + static_assert(__same_type(typeof(*dependent), *depended)); \ + WARN_ON_ONCE(dependent->member.ictx != depended->member.ictx); \ + _iommufd_object_undepend(&dependent->member.obj, \ + &depended->member.obj); \ + })
Thanks Nicolin
Some simple tests for IOMMUFD_CMD_HW_QUEUE_ALLOC infrastructure covering the new iommufd_hw_queue_depend/undepend() helpers.
Since the iommufd_hw_queue_alloc_ioctl() will call iommu_iova_to_phys(), for input address validation. It might intentionally pass in a bad input for a negative test. So drop the WARN_ON() in mock_domain_iova_to_phys().
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 3 + tools/testing/selftests/iommu/iommufd_utils.h | 31 ++++++++ drivers/iommu/iommufd/selftest.c | 73 ++++++++++++++++++- tools/testing/selftests/iommu/iommufd.c | 60 +++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 6 ++ 5 files changed, 172 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index fbf9ecb35a13..51cd744a354f 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -265,4 +265,7 @@ struct iommu_viommu_event_selftest { __u32 virt_id; };
+#define IOMMU_HW_QUEUE_TYPE_SELFTEST 0xdeadbeef +#define IOMMU_TEST_HW_QUEUE_MAX 2 + #endif diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index a5d4cbd089ba..463116fcfa9c 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -956,6 +956,37 @@ static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id, _test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, \ virt_id, vdev_id))
+static int _test_cmd_hw_queue_alloc(int fd, __u32 viommu_id, __u32 type, + __u32 idx, __u64 base_addr, __u64 length, + __u32 *hw_queue_id) +{ + struct iommu_hw_queue_alloc cmd = { + .size = sizeof(cmd), + .viommu_id = viommu_id, + .type = type, + .index = idx, + .base_addr = base_addr, + .length = length, + }; + int ret; + + ret = ioctl(fd, IOMMU_HW_QUEUE_ALLOC, &cmd); + if (ret) + return ret; + if (hw_queue_id) + *hw_queue_id = cmd.out_hw_queue_id; + return 0; +} + +#define test_cmd_hw_queue_alloc(viommu_id, type, idx, base_addr, len, out_qid) \ + ASSERT_EQ(0, _test_cmd_hw_queue_alloc(self->fd, viommu_id, type, idx, \ + base_addr, len, out_qid)) +#define test_err_hw_queue_alloc(_errno, viommu_id, type, idx, base_addr, len, \ + out_qid) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_hw_queue_alloc(self->fd, viommu_id, type, idx, \ + base_addr, len, out_qid)) + static int _test_cmd_veventq_alloc(int fd, __u32 viommu_id, __u32 type, __u32 *veventq_id, __u32 *veventq_fd) { diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 033345f81df9..692c29ccdc46 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -148,6 +148,7 @@ to_mock_nested(struct iommu_domain *domain) struct mock_viommu { struct iommufd_viommu core; struct mock_iommu_domain *s2_parent; + struct mock_hw_queue *hw_queue[IOMMU_TEST_HW_QUEUE_MAX]; };
static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) @@ -155,6 +156,19 @@ static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) return container_of(viommu, struct mock_viommu, core); }
+struct mock_hw_queue { + struct iommufd_hw_queue core; + struct mock_viommu *mock_viommu; + struct mock_hw_queue *prev; + u16 index; +}; + +static inline struct mock_hw_queue * +to_mock_hw_queue(struct iommufd_hw_queue *hw_queue) +{ + return container_of(hw_queue, struct mock_hw_queue, core); +} + enum selftest_obj_type { TYPE_IDEV, }; @@ -570,7 +584,8 @@ static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
WARN_ON(iova % MOCK_IO_PAGE_SIZE); ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE); - WARN_ON(!ent); + if (!ent) + return 0; return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE; }
@@ -727,10 +742,66 @@ static int mock_viommu_cache_invalidate(struct iommufd_viommu *viommu, return rc; }
+/* Test iommufd_hw_queue_depend/undepend() */ +static struct iommufd_hw_queue * +mock_hw_queue_alloc(struct iommufd_viommu *viommu, unsigned int type, u32 index, + u64 base_addr, size_t length) +{ + struct mock_viommu *mock_viommu = to_mock_viommu(viommu); + struct mock_hw_queue *mock_hw_queue, *prev = NULL; + int rc; + + if (type != IOMMU_HW_QUEUE_TYPE_SELFTEST) + return ERR_PTR(-EOPNOTSUPP); + if (index >= IOMMU_TEST_HW_QUEUE_MAX) + return ERR_PTR(-EINVAL); + if (mock_viommu->hw_queue[index]) + return ERR_PTR(-EEXIST); + if (index) { + prev = mock_viommu->hw_queue[index - 1]; + if (!prev) + return ERR_PTR(-EIO); + } + + mock_hw_queue = + iommufd_hw_queue_alloc(viommu, struct mock_hw_queue, core); + if (IS_ERR(mock_hw_queue)) + return ERR_CAST(mock_hw_queue); + + if (prev) { + rc = iommufd_hw_queue_depend(mock_hw_queue, prev, core); + if (rc) + goto free_hw_queue; + } + mock_hw_queue->prev = prev; + mock_hw_queue->mock_viommu = mock_viommu; + mock_viommu->hw_queue[index] = mock_hw_queue; + + return &mock_hw_queue->core; +free_hw_queue: + iommufd_struct_destroy(mock_hw_queue, core); + return ERR_PTR(rc); +} + +static void mock_hw_queue_destroy(struct iommufd_hw_queue *hw_queue) +{ + struct mock_hw_queue *mock_hw_queue = to_mock_hw_queue(hw_queue); + struct mock_viommu *mock_viommu = mock_hw_queue->mock_viommu; + + mock_viommu->hw_queue[mock_hw_queue->index] = NULL; + if (mock_hw_queue->prev) + iommufd_hw_queue_undepend(mock_hw_queue, mock_hw_queue->prev, + core); + + /* iommufd core frees mock_hw_queue and hw_queue */ +} + static struct iommufd_viommu_ops mock_viommu_ops = { .destroy = mock_viommu_destroy, .alloc_domain_nested = mock_viommu_alloc_domain_nested, .cache_invalidate = mock_viommu_cache_invalidate, + .hw_queue_alloc = mock_hw_queue_alloc, + .hw_queue_destroy = mock_hw_queue_destroy, };
static struct iommufd_viommu * diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 666a91d53195..bfdb3a3f1d23 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -3031,6 +3031,66 @@ TEST_F(iommufd_viommu, vdevice_cache) } }
+TEST_F(iommufd_viommu, hw_queue) +{ + uint32_t viommu_id = self->viommu_id; + __u64 iova = MOCK_APERTURE_START; + uint32_t hw_queue_id[2]; + + if (viommu_id) { + /* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */ + test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id, + IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova, + PAGE_SIZE, &hw_queue_id[0]); + /* Fail queue addr and length */ + test_err_hw_queue_alloc(EINVAL, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, 0, + PAGE_SIZE, &hw_queue_id[0]); + test_err_hw_queue_alloc(EINVAL, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, + 0, &hw_queue_id[0]); + test_err_hw_queue_alloc(EOVERFLOW, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, + ~(uint64_t)0, PAGE_SIZE, + &hw_queue_id[0]); + /* Fail missing iova */ + test_err_hw_queue_alloc(ENXIO, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, + PAGE_SIZE, &hw_queue_id[0]); + + /* Map iova */ + test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova); + + /* Fail index=1 and =MAX; must start from index=0 */ + test_err_hw_queue_alloc(EIO, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, iova, + PAGE_SIZE, &hw_queue_id[0]); + test_err_hw_queue_alloc(EINVAL, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, + IOMMU_TEST_HW_QUEUE_MAX, iova, + PAGE_SIZE, &hw_queue_id[0]); + + /* Allocate index=0 */ + test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + 0, iova, PAGE_SIZE, &hw_queue_id[0]); + /* Fail duplicate */ + test_err_hw_queue_alloc(EEXIST, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, + PAGE_SIZE, &hw_queue_id[0]); + + /* Allocate index=1 */ + test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + 1, iova, PAGE_SIZE, &hw_queue_id[1]); + /* Fail to destroy, due to dependency */ + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, hw_queue_id[0])); + + /* Destroy in descending order */ + test_ioctl_destroy(hw_queue_id[1]); + test_ioctl_destroy(hw_queue_id[0]); + } +} + FIXTURE(iommufd_device_pasid) { int fd; diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index f7ccf1822108..41c685bbd252 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -634,6 +634,7 @@ TEST_FAIL_NTH(basic_fail_nth, device) uint32_t idev_id; uint32_t hwpt_id; uint32_t viommu_id; + uint32_t hw_queue_id; uint32_t vdev_id; __u64 iova;
@@ -696,6 +697,11 @@ TEST_FAIL_NTH(basic_fail_nth, device) if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id)) return -1;
+ if (_test_cmd_hw_queue_alloc(self->fd, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, + PAGE_SIZE, &hw_queue_id)) + return -1; + if (_test_ioctl_fault_alloc(self->fd, &fault_id, &fault_fd)) return -1; close(fault_fd);
For vIOMMU passing through HW resources to user space (VMs), allowing a VM to control the passed through HW directly by accessing hardware registers, add an mmap infrastructure to map the physical MMIO pages to user space.
Maintain an maple tree per ictx as a translation table managing mmappable regions, from an allocated for-user mmap offset to an iommufd_mmap struct, where it stores the real PFN range for a remap_pfn_range call.
To allow an IOMMU driver to add and delete mmappable regions onto/from the maple tree, add iommufd_viommu_alloc/destroy_mmap helpers.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 11 +++++ include/linux/iommufd.h | 39 ++++++++++++++++ drivers/iommu/iommufd/driver.c | 52 +++++++++++++++++++++ drivers/iommu/iommufd/main.c | 61 +++++++++++++++++++++++++ 4 files changed, 163 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 36a4e060982f..c87326d7ecfc 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/iommufd.h> #include <linux/iova_bitmap.h> +#include <linux/maple_tree.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -44,6 +45,7 @@ struct iommufd_ctx { struct xarray groups; wait_queue_head_t destroy_wait; struct rw_semaphore ioas_creation_lock; + struct maple_tree mt_mmap;
struct mutex sw_msi_lock; struct list_head sw_msi_list; @@ -55,6 +57,15 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; };
+/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap { + struct iommufd_object *owner; + + /* Physical range for remap_pfn_range() */ + unsigned long base_pfn; + unsigned long num_pfns; +}; + /* * The IOVA to PFN map. The map automatically copies the PFNs into multiple * domains and permits sharing of PFNs between io_pagetable instances. This diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ddca0d2835df..47af130f4212 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -245,6 +245,10 @@ int _iommufd_object_depend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); void _iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); +int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner, + phys_addr_t base, size_t length, unsigned long *offset); +void _iommufd_destroy_mmap(struct iommufd_ctx *ictx, + struct iommufd_object *owner, unsigned long offset); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -277,6 +281,20 @@ _iommufd_object_undepend(struct iommufd_object *obj_dependent, { }
+static inline int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, + struct iommufd_object *owner, + phys_addr_t base, size_t length, + unsigned long *offset) +{ + return -EOPNOTSUPP; +} + +static inline void _iommufd_destroy_mmap(struct iommufd_ctx *ictx, + struct iommufd_object *owner, + unsigned long offset) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -390,4 +408,25 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, _iommufd_object_undepend(&dependent->member.obj, \ &depended->member.obj); \ }) + +/* + * Helpers for IOMMU driver to alloc/destroy an mmapable area for a structure. + * Driver should report the @out_offset and @length to user space for an mmap() + */ +#define iommufd_viommu_alloc_mmap(viommu, member, base, length, out_offset) \ + ({ \ + static_assert(__same_type(struct iommufd_viommu, \ + viommu->member)); \ + static_assert(offsetof(typeof(*viommu), member.obj) == 0); \ + _iommufd_alloc_mmap(viommu->member.ictx, &viommu->member.obj, \ + base, length, out_offset); \ + }) +#define iommufd_viommu_destroy_mmap(viommu, member, offset) \ + ({ \ + static_assert(__same_type(struct iommufd_viommu, \ + viommu->member)); \ + static_assert(offsetof(typeof(*viommu), member.obj) == 0); \ + _iommufd_destroy_mmap(viommu->member.ictx, \ + &viommu->member.obj, offset); \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 0bcf0438d255..2d2695e2562d 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,58 @@ void _iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(_iommufd_object_undepend, "IOMMUFD");
+/* + * Allocate an @offset to return to user space to use for an mmap() syscall + * + * Driver should use a per-structure helper in include/linux/iommufd.h + */ +int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner, + phys_addr_t base, size_t length, unsigned long *offset) +{ + struct iommufd_mmap *immap; + unsigned long startp; + int num_pfns, rc; + + if (WARN_ON_ONCE(!offset)) + return -EINVAL; + if (!PAGE_ALIGNED(base)) + return -EINVAL; + if (!length || !PAGE_ALIGNED(length)) + return -EINVAL; + num_pfns = length >> PAGE_SHIFT; + + immap = kzalloc(sizeof(*immap), GFP_KERNEL); + if (!immap) + return -ENOMEM; + immap->owner = owner; + immap->base_pfn = base >> PAGE_SHIFT; + immap->num_pfns = length >> PAGE_SHIFT; + + rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->num_pfns, + 0, U32_MAX >> PAGE_SHIFT, GFP_KERNEL); + if (rc < 0) { + kfree(immap); + return rc; + } + + /* mmap() syscall will right-shift the offset in vma->vm_pgoff */ + *offset = startp << PAGE_SHIFT; + return 0; +} +EXPORT_SYMBOL_NS_GPL(_iommufd_alloc_mmap, "IOMMUFD"); + +/* Driver should use a per-structure helper in include/linux/iommufd.h */ +void _iommufd_destroy_mmap(struct iommufd_ctx *ictx, + struct iommufd_object *owner, unsigned long offset) +{ + struct iommufd_mmap *immap; + + immap = mtree_erase(&ictx->mt_mmap, offset >> PAGE_SHIFT); + WARN_ON_ONCE(!immap || immap->owner != owner); + kfree(immap); +} +EXPORT_SYMBOL_NS_GPL(_iommufd_destroy_mmap, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 10410e2f710a..1d7f3584aea0 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -213,6 +213,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp) xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT); xa_init(&ictx->groups); ictx->file = filp; + mt_init_flags(&ictx->mt_mmap, MT_FLAGS_ALLOC_RANGE); init_waitqueue_head(&ictx->destroy_wait); mutex_init(&ictx->sw_msi_lock); INIT_LIST_HEAD(&ictx->sw_msi_list); @@ -410,11 +411,71 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, return ret; }
+static void iommufd_fops_vma_open(struct vm_area_struct *vma) +{ + struct iommufd_mmap *immap = vma->vm_private_data; + + refcount_inc(&immap->owner->users); +} + +static void iommufd_fops_vma_close(struct vm_area_struct *vma) +{ + struct iommufd_mmap *immap = vma->vm_private_data; + + refcount_dec(&immap->owner->users); +} + +static const struct vm_operations_struct iommufd_vma_ops = { + .open = iommufd_fops_vma_open, + .close = iommufd_fops_vma_close, +}; + +/* + * Kernel driver must first use the for-driver helpers to register an mmappable + * MMIO region to the iommufd core to allocate an offset. Then, it should report + * to user space this offset and the length of the MMIO region for mmap syscall, + * via a prior IOMMU_VIOMMU_ALLOC ioctl. + */ +static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct iommufd_ctx *ictx = filp->private_data; + size_t length = vma->vm_end - vma->vm_start; + struct iommufd_mmap *immap; + int rc; + + if (!PAGE_ALIGNED(length)) + return -EINVAL; + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + if (vma->vm_flags & VM_EXEC) + return -EPERM; + + /* vma->vm_pgoff carries an index to an mtree entry (immap) */ + immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff); + if (!immap) + return -ENXIO; + if (length >> PAGE_SHIFT != immap->num_pfns) + return -ENXIO; + + vma->vm_pgoff = 0; + vma->vm_private_data = immap; + vma->vm_ops = &iommufd_vma_ops; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO); + + rc = remap_pfn_range(vma, vma->vm_start, immap->base_pfn, length, + vma->vm_page_prot); + if (!rc) /* vm_ops.open won't be called for mmap itself. */ + refcount_inc(&immap->owner->users); + return rc; +} + static const struct file_operations iommufd_fops = { .owner = THIS_MODULE, .open = iommufd_fops_open, .release = iommufd_fops_release, .unlocked_ioctl = iommufd_fops_ioctl, + .mmap = iommufd_fops_mmap, };
/**
Hi Nicolin,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 92a09c47464d040866cf2b4cd052bc60555185fb]
url: https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommufd-viommu-A... base: 92a09c47464d040866cf2b4cd052bc60555185fb patch link: https://lore.kernel.org/r/ee9ee287264fd75eb4fc64a63f20d03e9ba18161.174675763... patch subject: [PATCH v4 14/23] iommufd: Add mmap interface config: i386-buildonly-randconfig-003-20250509 (https://download.01.org/0day-ci/archive/20250509/202505092119.UALKhnIX-lkp@i...) compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250509/202505092119.UALKhnIX-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202505092119.UALKhnIX-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/iommu/iommufd/driver.c:91:6: warning: variable 'num_pfns' set but not used [-Wunused-but-set-variable]
91 | int num_pfns, rc; | ^ 1 warning generated.
vim +/num_pfns +91 drivers/iommu/iommufd/driver.c
80 81 /* 82 * Allocate an @offset to return to user space to use for an mmap() syscall 83 * 84 * Driver should use a per-structure helper in include/linux/iommufd.h 85 */ 86 int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner, 87 phys_addr_t base, size_t length, unsigned long *offset) 88 { 89 struct iommufd_mmap *immap; 90 unsigned long startp;
91 int num_pfns, rc;
92 93 if (WARN_ON_ONCE(!offset)) 94 return -EINVAL; 95 if (!PAGE_ALIGNED(base)) 96 return -EINVAL; 97 if (!length || !PAGE_ALIGNED(length)) 98 return -EINVAL; 99 num_pfns = length >> PAGE_SHIFT; 100 101 immap = kzalloc(sizeof(*immap), GFP_KERNEL); 102 if (!immap) 103 return -ENOMEM; 104 immap->owner = owner; 105 immap->base_pfn = base >> PAGE_SHIFT; 106 immap->num_pfns = length >> PAGE_SHIFT; 107 108 rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->num_pfns, 109 0, U32_MAX >> PAGE_SHIFT, GFP_KERNEL); 110 if (rc < 0) { 111 kfree(immap); 112 return rc; 113 } 114 115 /* mmap() syscall will right-shift the offset in vma->vm_pgoff */ 116 *offset = startp << PAGE_SHIFT; 117 return 0; 118 } 119 EXPORT_SYMBOL_NS_GPL(_iommufd_alloc_mmap, "IOMMUFD"); 120
On Thu, May 08, 2025 at 08:02:35PM -0700, Nicolin Chen wrote:
+int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner,
phys_addr_t base, size_t length, unsigned long *offset)
+{
...
- int num_pfns, rc;
...
- num_pfns = length >> PAGE_SHIFT;
I forgot to clean away this num_pfns, as "length >> PAGE_SHIFT" stores to immap->num_pfns already.
Will drop this in v5.
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
+/*
- Kernel driver must first use the for-driver helpers to register an
mmappable
- MMIO region to the iommufd core to allocate an offset. Then, it should
report
- to user space this offset and the length of the MMIO region for mmap
syscall,
- via a prior IOMMU_VIOMMU_ALLOC ioctl.
- */
this comment better suits _iommufd_alloc_mmap()
+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{
- struct iommufd_ctx *ictx = filp->private_data;
- size_t length = vma->vm_end - vma->vm_start;
- struct iommufd_mmap *immap;
- int rc;
- if (!PAGE_ALIGNED(length))
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;
- if (vma->vm_flags & VM_EXEC)
return -EPERM;
- /* vma->vm_pgoff carries an index to an mtree entry (immap) */
- immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff);
- if (!immap)
return -ENXIO;
- if (length >> PAGE_SHIFT != immap->num_pfns)
return -ENXIO;
- vma->vm_pgoff = 0;
- vma->vm_private_data = immap;
- vma->vm_ops = &iommufd_vma_ops;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP | VM_IO);
- rc = remap_pfn_range(vma, vma->vm_start, immap->base_pfn,
length,
vma->vm_page_prot);
- if (!rc) /* vm_ops.open won't be called for mmap itself. */
refcount_inc(&immap->owner->users);
- return rc;
let's add some words for this refcnt thing in the commit msg.
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:35PM -0700, Nicolin Chen wrote:
+int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner,
phys_addr_t base, size_t length, unsigned long *offset)
+{
- struct iommufd_mmap *immap;
- unsigned long startp;
- int num_pfns, rc;
- if (WARN_ON_ONCE(!offset))
return -EINVAL;
We don't need checks like this, just let it oops
- if (!PAGE_ALIGNED(base))
return -EINVAL;
- if (!length || !PAGE_ALIGNED(length))
return -EINVAL;
- num_pfns = length >> PAGE_SHIFT;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->owner = owner;
- immap->base_pfn = base >> PAGE_SHIFT;
'base' is a confusing name for the mmio address. Call it mmio_pfn or something
+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{
- struct iommufd_ctx *ictx = filp->private_data;
- size_t length = vma->vm_end - vma->vm_start;
- struct iommufd_mmap *immap;
- int rc;
- if (!PAGE_ALIGNED(length))
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;
- if (vma->vm_flags & VM_EXEC)
return -EPERM;
- /* vma->vm_pgoff carries an index to an mtree entry (immap) */
- immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff);
- if (!immap)
return -ENXIO;
- if (length >> PAGE_SHIFT != immap->num_pfns)
return -ENXIO;
This needs to validate that vm_pgoff is at the start of the immap or num_pfns is the wrong thing to validate length against.
length >> PAGE_SHIFT will truncate non-zero bits which will not check it properly.
- vma->vm_pgoff = 0;
- vma->vm_private_data = immap;
- vma->vm_ops = &iommufd_vma_ops;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO);
remap_pfn_range already sets these vm_flags
- rc = remap_pfn_range(vma, vma->vm_start, immap->base_pfn, length,
vma->vm_page_prot);
This shoudl be io_remap_pfn_range() if it is mmio
- if (!rc) /* vm_ops.open won't be called for mmap itself. */
refcount_inc(&immap->owner->users);
- return rc;
Success oriented flow
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, May 16, 2025 12:47 AM
On Thu, May 08, 2025 at 08:02:35PM -0700, Nicolin Chen wrote:
- /* vma->vm_pgoff carries an index to an mtree entry (immap) */
- immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff);
- if (!immap)
return -ENXIO;
- if (length >> PAGE_SHIFT != immap->num_pfns)
return -ENXIO;
This needs to validate that vm_pgoff is at the start of the immap or num_pfns is the wrong thing to validate length against.
vm_pgoff is the index into mtree. If it's wrong mtree_load() will fail already?
Extend the loopback test to a new mmap page.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 4 +++ drivers/iommu/iommufd/selftest.c | 41 ++++++++++++++++++++++--- tools/testing/selftests/iommu/iommufd.c | 7 +++++ 3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 51cd744a354f..8fc618b2bcf9 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -232,12 +232,16 @@ struct iommu_hwpt_invalidate_selftest { * (IOMMU_VIOMMU_TYPE_SELFTEST) * @in_data: Input random data from user space * @out_data: Output data (matching @in_data) to user space + * @out_mmap_offset: The offset argument for mmap syscall + * @out_mmap_length: The length argument for mmap syscall * * Simply set @out_data=@in_data for a loopback test */ struct iommu_viommu_selftest { __u32 in_data; __u32 out_data; + __aligned_u64 out_mmap_offset; + __aligned_u64 out_mmap_length; };
/* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 692c29ccdc46..36e7fc0059dc 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -149,6 +149,9 @@ struct mock_viommu { struct iommufd_viommu core; struct mock_iommu_domain *s2_parent; struct mock_hw_queue *hw_queue[IOMMU_TEST_HW_QUEUE_MAX]; + + unsigned long mmap_offset; + u32 *page; /* Mmap page to test u32 type of in_data */ };
static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) @@ -647,9 +650,14 @@ static void mock_viommu_destroy(struct iommufd_viommu *viommu) { struct mock_iommu_device *mock_iommu = container_of( viommu->iommu_dev, struct mock_iommu_device, iommu_dev); + struct mock_viommu *mock_viommu = to_mock_viommu(viommu);
if (refcount_dec_and_test(&mock_iommu->users)) complete(&mock_iommu->complete); + if (mock_viommu->mmap_offset) + iommufd_viommu_destroy_mmap(mock_viommu, core, + mock_viommu->mmap_offset); + free_page((unsigned long)mock_viommu->page);
/* iommufd core frees mock_viommu and viommu */ } @@ -831,17 +839,42 @@ mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, return ERR_CAST(mock_viommu);
if (user_data) { + mock_viommu->page = + (u32 *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + if (!mock_viommu->page) { + rc = -ENOMEM; + goto err_destroy_struct; + } + + rc = iommufd_viommu_alloc_mmap(mock_viommu, core, + __pa(mock_viommu->page), + PAGE_SIZE, + &mock_viommu->mmap_offset); + if (rc) + goto err_free_page; + + /* For loopback tests on both the page and out_data */ + *mock_viommu->page = data.in_data; data.out_data = data.in_data; + data.out_mmap_length = PAGE_SIZE; + data.out_mmap_offset = mock_viommu->mmap_offset; rc = iommu_copy_struct_to_user( user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); - if (rc) { - iommufd_struct_destroy(mock_viommu, core); - return ERR_PTR(rc); - } + if (rc) + goto err_destroy_mmap; }
refcount_inc(&mock_iommu->users); return &mock_viommu->core; + +err_destroy_mmap: + iommufd_viommu_destroy_mmap(mock_viommu, core, + mock_viommu->mmap_offset); +err_free_page: + free_page((unsigned long)mock_viommu->page); +err_destroy_struct: + iommufd_struct_destroy(mock_viommu, core); + return ERR_PTR(rc); }
static const struct iommu_ops mock_ops = { diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index bfdb3a3f1d23..38ad71efb338 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2799,12 +2799,19 @@ TEST_F(iommufd_viommu, viommu_alloc_with_data) struct iommu_viommu_selftest data = { .in_data = 0xbeef, }; + uint32_t *test;
if (self->device_id) { test_cmd_viommu_alloc(self->device_id, self->hwpt_id, IOMMU_VIOMMU_TYPE_SELFTEST, &data, sizeof(data), &self->viommu_id); assert(data.out_data == data.in_data); + test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE, + MAP_SHARED, self->fd, data.out_mmap_offset); + assert(test && *test == data.in_data); + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, self->viommu_id)); + munmap(test, data.out_mmap_length); } }
With the introduction of the new object and its infrastructure, update the doc to reflect that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- Documentation/userspace-api/iommufd.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst index b0df15865dec..03f7510384d2 100644 --- a/Documentation/userspace-api/iommufd.rst +++ b/Documentation/userspace-api/iommufd.rst @@ -124,6 +124,17 @@ Following IOMMUFD objects are exposed to userspace: used to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS, but is confined to one vEVENTQ per vEVENTQ type.
+- IOMMUFD_OBJ_HW_QUEUE, representing a hardware accelerated queue, as a subset + of IOMMU's virtualization features, for the IOMMU HW to directly read or write + the virtual queue memory owned by a guest OS. This HW-acceleration feature can + allow VM to work with the IOMMU HW directly without a VM Exit, so as to reduce + overhead from the hypercalls. Along with the HW QUEUE object, iommufd provides + user space an mmap interface for VMM to mmap a physical MMIO region from the + host physical address space to the guest physical address space, allowing the + guest OS to directly control the allocated HW QUEUE. Thus, when allocating a + HW QUEUE, the VMM must request a pair of mmap info (offset/length) and pass in + exactly to an mmap syscall via its offset and length arguments. + All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
The diagrams below show relationships between user-visible objects and kernel @@ -270,6 +281,7 @@ User visible objects are backed by following datastructures: - iommufd_viommu for IOMMUFD_OBJ_VIOMMU. - iommufd_vdevice for IOMMUFD_OBJ_VDEVICE. - iommufd_veventq for IOMMUFD_OBJ_VEVENTQ. +- iommufd_hw_queue for IOMMUFD_OBJ_HW_QUEUE.
Several terminologies when looking at these datastructures:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
With the introduction of the new object and its infrastructure, update the doc to reflect that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:37PM -0700, Nicolin Chen wrote:
With the introduction of the new object and its infrastructure, update the doc to reflect that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Documentation/userspace-api/iommufd.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
An impl driver might want to allocate its own type of vIOMMU object or the standard IOMMU_VIOMMU_TYPE_ARM_SMMUV3 by setting up its own SW/HW bits, as the tegra241-cmdqv driver will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.
Add a vsmmu_alloc op and prioritize it in arm_vsmmu_alloc().
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 6 ++++++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 17 +++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 6b8f0d20dac3..a5835af72417 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -16,6 +16,7 @@ #include <linux/sizes.h>
struct arm_smmu_device; +struct arm_smmu_domain;
/* MMIO registers */ #define ARM_SMMU_IDR0 0x0 @@ -720,6 +721,11 @@ struct arm_smmu_impl_ops { int (*init_structures)(struct arm_smmu_device *smmu); struct arm_smmu_cmdq *(*get_secondary_cmdq)( struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent); + struct arm_vsmmu *(*vsmmu_alloc)( + struct arm_smmu_device *smmu, + struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx, + unsigned int viommu_type, + const struct iommu_user_data *user_data); };
/* An SMMUv3 instance */ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index 66855cae775e..b316d1df043f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -392,10 +392,7 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, iommu_get_iommu_dev(dev, struct arm_smmu_device, iommu); struct arm_smmu_master *master = dev_iommu_priv_get(dev); struct arm_smmu_domain *s2_parent = to_smmu_domain(parent); - struct arm_vsmmu *vsmmu; - - if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3) - return ERR_PTR(-EOPNOTSUPP); + struct arm_vsmmu *vsmmu = ERR_PTR(-EOPNOTSUPP);
if (!(smmu->features & ARM_SMMU_FEAT_NESTING)) return ERR_PTR(-EOPNOTSUPP); @@ -423,8 +420,16 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, !(smmu->features & ARM_SMMU_FEAT_S2FWB)) return ERR_PTR(-EOPNOTSUPP);
- vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, core, - &arm_vsmmu_ops); + /* Prioritize the impl that may support IOMMU_VIOMMU_TYPE_ARM_SMMUV3 */ + if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc) + vsmmu = master->smmu->impl_ops->vsmmu_alloc( + master->smmu, s2_parent, ictx, viommu_type, user_data); + if (PTR_ERR(vsmmu) == -EOPNOTSUPP) { + /* Otherwise, allocate an IOMMU_VIOMMU_TYPE_ARM_SMMUV3 here */ + if (viommu_type == IOMMU_VIOMMU_TYPE_ARM_SMMUV3) + vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, + core, &arm_vsmmu_ops); + } if (IS_ERR(vsmmu)) return ERR_CAST(vsmmu);
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
An impl driver might want to allocate its own type of vIOMMU object or the standard IOMMU_VIOMMU_TYPE_ARM_SMMUV3 by setting up its own SW/HW bits, as the tegra241-cmdqv driver will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.
Add a vsmmu_alloc op and prioritize it in arm_vsmmu_alloc().
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:38PM -0700, Nicolin Chen wrote:
An impl driver might want to allocate its own type of vIOMMU object or the standard IOMMU_VIOMMU_TYPE_ARM_SMMUV3 by setting up its own SW/HW bits, as the tegra241-cmdqv driver will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.
Add a vsmmu_alloc op and prioritize it in arm_vsmmu_alloc().
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 6 ++++++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 17 +++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 6b8f0d20dac3..a5835af72417 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -16,6 +16,7 @@ #include <linux/sizes.h> struct arm_smmu_device; +struct arm_smmu_domain; /* MMIO registers */ #define ARM_SMMU_IDR0 0x0 @@ -720,6 +721,11 @@ struct arm_smmu_impl_ops { int (*init_structures)(struct arm_smmu_device *smmu); struct arm_smmu_cmdq *(*get_secondary_cmdq)( struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
- struct arm_vsmmu *(*vsmmu_alloc)(
struct arm_smmu_device *smmu,
struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx,
unsigned int viommu_type,
const struct iommu_user_data *user_data);
};
I think you should put the supported viommu type here in the ops struct and match it here:
- /* Prioritize the impl that may support IOMMU_VIOMMU_TYPE_ARM_SMMUV3 */
- if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc)
vsmmu = master->smmu->impl_ops->vsmmu_alloc(
master->smmu, s2_parent, ictx, viommu_type, user_data);
instead of the EOPNOTSUPP dance. Either the impl_ops supports the requested viommu as an extension or we are running in the normal mode?
Is there a reason to allocate a different viommu if the userspace does not enable the implementation specific features?
Jason
On Thu, May 15, 2025 at 02:19:02PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:38PM -0700, Nicolin Chen wrote:
An impl driver might want to allocate its own type of vIOMMU object or the standard IOMMU_VIOMMU_TYPE_ARM_SMMUV3 by setting up its own SW/HW bits, as the tegra241-cmdqv driver will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.
Add a vsmmu_alloc op and prioritize it in arm_vsmmu_alloc().
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 6 ++++++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 17 +++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 6b8f0d20dac3..a5835af72417 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -16,6 +16,7 @@ #include <linux/sizes.h> struct arm_smmu_device; +struct arm_smmu_domain; /* MMIO registers */ #define ARM_SMMU_IDR0 0x0 @@ -720,6 +721,11 @@ struct arm_smmu_impl_ops { int (*init_structures)(struct arm_smmu_device *smmu); struct arm_smmu_cmdq *(*get_secondary_cmdq)( struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
- struct arm_vsmmu *(*vsmmu_alloc)(
struct arm_smmu_device *smmu,
struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx,
unsigned int viommu_type,
const struct iommu_user_data *user_data);
};
I think you should put the supported viommu type here in the ops struct and match it here:
OK. A single type per impl might be enough for now, so it can be a static one.
- /* Prioritize the impl that may support IOMMU_VIOMMU_TYPE_ARM_SMMUV3 */
- if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc)
vsmmu = master->smmu->impl_ops->vsmmu_alloc(
master->smmu, s2_parent, ictx, viommu_type, user_data);
instead of the EOPNOTSUPP dance. Either the impl_ops supports the requested viommu as an extension or we are running in the normal mode?
I think we can only do normal mode if requested viommu is the normal SMMUV3 type, i.e. still need to reject a type other than !CMDQV nor !SMMUV3, right?
Is there a reason to allocate a different viommu if the userspace does not enable the implementation specific features?
Hmm, what is this different viommu?
If VMM doesn't want VCMDQ, it should go with the normal SMMUV3 type.
Thanks Nicolin
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
This will be used by Tegra241 CMDQV implementation on top of a standard ARM SMMUv3 IOMMU. The @impl will be only valid if @flags is set with an implementation-defined flag.
Thus in the driver-level, add an hw_info impl op that will return such a flag and fill the impl field.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + include/uapi/linux/iommufd.h | 4 ++-- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 17 ++++++++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index a5835af72417..bab7a9ce1283 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -726,6 +726,7 @@ struct arm_smmu_impl_ops { struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx, unsigned int viommu_type, const struct iommu_user_data *user_data); + u32 (*hw_info)(struct arm_smmu_device *smmu, u32 *impl); };
/* An SMMUv3 instance */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 001e2deb5a2d..28ab42a85268 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -554,7 +554,7 @@ struct iommu_hw_info_vtd { * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) * * @flags: Must be set to 0 - * @__reserved: Must be 0 + * @impl: Must be 0 * @idr: Implemented features for ARM SMMU Non-secure programming interface * @iidr: Information about the implementation and implementer of ARM SMMU, * and architecture version supported @@ -585,7 +585,7 @@ struct iommu_hw_info_vtd { */ struct iommu_hw_info_arm_smmuv3 { __u32 flags; - __u32 __reserved; + __u32 impl; __u32 idr[6]; __u32 iidr; __u32 aidr; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index b316d1df043f..8ea9a022e444 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -10,6 +10,7 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type) { struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct arm_smmu_device *smmu = master->smmu; struct iommu_hw_info_arm_smmuv3 *info; u32 __iomem *base_idr; unsigned int i; @@ -18,15 +19,25 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type) if (!info) return ERR_PTR(-ENOMEM);
- base_idr = master->smmu->base + ARM_SMMU_IDR0; + base_idr = smmu->base + ARM_SMMU_IDR0; for (i = 0; i <= 5; i++) info->idr[i] = readl_relaxed(base_idr + i); - info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR); - info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR); + info->iidr = readl_relaxed(smmu->base + ARM_SMMU_IIDR); + info->aidr = readl_relaxed(smmu->base + ARM_SMMU_AIDR);
*length = sizeof(*info); *type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
+ if (smmu->impl_ops && smmu->impl_ops->hw_info) { + u32 flags, impl; + + flags = smmu->impl_ops->hw_info(smmu, &impl); + if (flags) { + info->impl = impl; + info->flags |= flags; + } + } + return info; }
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
This will be used by Tegra241 CMDQV implementation on top of a standard ARM SMMUv3 IOMMU. The @impl will be only valid if @flags is set with an implementation-defined flag.
Thus in the driver-level, add an hw_info impl op that will return such a flag and fill the impl field.
Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
It would be nicer to have a tegra/cmdq specific struct and a way for iommu_hw_info to select it. 'impl' isn't going to scale very well if something else wants to use this.
We have out_data_type but we could have an input sub_data_type too. 0 means what we have today, then a simple enum to select another info struct.
@@ -726,6 +726,7 @@ struct arm_smmu_impl_ops { struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx, unsigned int viommu_type, const struct iommu_user_data *user_data);
- u32 (*hw_info)(struct arm_smmu_device *smmu, u32 *impl);
};
Then the dispatch here is just having a sub-struct enum # in arm_smmu_impl_ops for dispatching.
Jason
On Thu, May 15, 2025 at 02:17:06PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
It would be nicer to have a tegra/cmdq specific struct and a way for iommu_hw_info to select it. 'impl' isn't going to scale very well if something else wants to use this.
We have out_data_type but we could have an input sub_data_type too. 0 means what we have today, then a simple enum to select another info struct.
So, there will be two hw_info calls back to back right?
Should the first call return out_data_type=CMDQV while returning the arm_smmu_v3 hw_info data? Otherwise, VMM wouldn't know what to set in the input sub_data_type of the 2nd ioctl?
Thanks Nicolin
On Thu, May 15, 2025 at 11:52:05AM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 02:17:06PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
It would be nicer to have a tegra/cmdq specific struct and a way for iommu_hw_info to select it. 'impl' isn't going to scale very well if something else wants to use this.
We have out_data_type but we could have an input sub_data_type too. 0 means what we have today, then a simple enum to select another info struct.
So, there will be two hw_info calls back to back right?
yes
Should the first call return out_data_type=CMDQV while returning the arm_smmu_v3 hw_info data? Otherwise, VMM wouldn't know what to set in the input sub_data_type of the 2nd ioctl?
No, either set a flag in the smmu_v3 hw_info, as you were doing here, or just have the vmm probe it. Given the VMM is likely to be told to run in vCMDQ mode on the command line try-and-fail doesn't sound so bad.
And I guess we don't need a "sub type" just a "requested type" where 0 means return the best one and non-zero means return a specific one or fail with EOPNOTSUPP.
Jason
On Thu, May 15, 2025 at 03:56:29PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 11:52:05AM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 02:17:06PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Should the first call return out_data_type=CMDQV while returning the arm_smmu_v3 hw_info data? Otherwise, VMM wouldn't know what to set in the input sub_data_type of the 2nd ioctl?
No, either set a flag in the smmu_v3 hw_info, as you were doing here, or just have the vmm probe it. Given the VMM is likely to be told to run in vCMDQ mode on the command line try-and-fail doesn't sound so bad.
And I guess we don't need a "sub type" just a "requested type" where 0 means return the best one and non-zero means return a specific one or fail with EOPNOTSUPP.
OK. I think this would work: hw_info (req_type=0) => out_data_type=SMMU_V3, flags=HAS_CMDQV hw_info (req_type=CMDQV) => out_data_type=CMDQV, flags=0
Or, would it be simpler by having a sub_data_uptr: hw_info => out_data_type=SMMU_V3, sub_data_type=CMDQV, data_uptr=iommu_hw_info_arm_smmuv3, sub_data_uptr=iommu_hw_info_tegra241_cmdqv ?
Thanks Nicolin
On Thu, May 15, 2025 at 12:21:17PM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 03:56:29PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 11:52:05AM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 02:17:06PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Should the first call return out_data_type=CMDQV while returning the arm_smmu_v3 hw_info data? Otherwise, VMM wouldn't know what to set in the input sub_data_type of the 2nd ioctl?
No, either set a flag in the smmu_v3 hw_info, as you were doing here, or just have the vmm probe it. Given the VMM is likely to be told to run in vCMDQ mode on the command line try-and-fail doesn't sound so bad.
And I guess we don't need a "sub type" just a "requested type" where 0 means return the best one and non-zero means return a specific one or fail with EOPNOTSUPP.
OK. I think this would work: hw_info (req_type=0) => out_data_type=SMMU_V3, flags=HAS_CMDQV hw_info (req_type=CMDQV) => out_data_type=CMDQV, flags=0
Yeah
Or, would it be simpler by having a sub_data_uptr: hw_info => out_data_type=SMMU_V3, sub_data_type=CMDQV, data_uptr=iommu_hw_info_arm_smmuv3, sub_data_uptr=iommu_hw_info_tegra241_cmdqv ?
I think the former is simpler to code, you can just add the req_type to the signatures and if the driver comes back with a type != req_type the core code will return EOPNOTSUPP
Then just match the impl_ops for req_type in the arm driver.
Finally we end up with only one ioctl enum number space for the types, which seems appealing.
Jason
On Thu, May 15, 2025 at 04:23:29PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 12:21:17PM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 03:56:29PM -0300, Jason Gunthorpe wrote:
On Thu, May 15, 2025 at 11:52:05AM -0700, Nicolin Chen wrote:
On Thu, May 15, 2025 at 02:17:06PM -0300, Jason Gunthorpe wrote:
On Thu, May 08, 2025 at 08:02:39PM -0700, Nicolin Chen wrote:
Should the first call return out_data_type=CMDQV while returning the arm_smmu_v3 hw_info data? Otherwise, VMM wouldn't know what to set in the input sub_data_type of the 2nd ioctl?
No, either set a flag in the smmu_v3 hw_info, as you were doing here, or just have the vmm probe it. Given the VMM is likely to be told to run in vCMDQ mode on the command line try-and-fail doesn't sound so bad.
And I guess we don't need a "sub type" just a "requested type" where 0 means return the best one and non-zero means return a specific one or fail with EOPNOTSUPP.
OK. I think this would work: hw_info (req_type=0) => out_data_type=SMMU_V3, flags=HAS_CMDQV hw_info (req_type=CMDQV) => out_data_type=CMDQV, flags=0
Yeah
Or, would it be simpler by having a sub_data_uptr: hw_info => out_data_type=SMMU_V3, sub_data_type=CMDQV, data_uptr=iommu_hw_info_arm_smmuv3, sub_data_uptr=iommu_hw_info_tegra241_cmdqv ?
I think the former is simpler to code, you can just add the req_type to the signatures and if the driver comes back with a type != req_type the core code will return EOPNOTSUPP
OK.
Maybe just turn the out_data_type to be bidirectional?
Then we would only need to update the docs: /** * enum iommu_hw_info_type - IOMMU Hardware Info Types - * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware - * info + * @IOMMU_HW_INFO_TYPE_NONE: (for output) used by the drivers that do not report + * hardware info + * @IOMMU_HW_INFO_TYPE_DEFAULT: (for input) Used to request the default type * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: Subtype of ARM SMMUv3 for Tegra241 CMDQV */ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_NONE = 0, + IOMMU_HW_INFO_TYPE_DEFAULT = 0, IOMMU_HW_INFO_TYPE_INTEL_VTD = 1, IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2, + IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV =3, };
- * @out_data_type: Output the iommu hardware info type as defined in the enum - * iommu_hw_info_type. + * @data_type: Bidirectional property. + * Input the requested iommu hardware info type as defined in the + * enum iommu_hw_info_type. Requesting IOMMU_HW_INFO_TYPE_DEFAULT + * lets kernel pick the default type to output, otherwise kernel + * will validate the input type and may reject with -EOPNOTSUPP. + * Output the supported iommu hardware info type as defined in the + * same enum iommu_hw_info_type
And similarly in the iommu API kdoc too.
Finally we end up with only one ioctl enum number space for the types, which seems appealing.
Yea. Avoiding a sub enum is nicer.
Thanks Nicolin
On Thu, May 15, 2025 at 01:17:47PM -0700, Nicolin Chen wrote:
I think the former is simpler to code, you can just add the req_type to the signatures and if the driver comes back with a type != req_type the core code will return EOPNOTSUPP
OK.
Maybe just turn the out_data_type to be bidirectional?
You can do that, bu you need to add a flag to turn it on. We didn't validate that it was 0 on input before so we can't start now.
Jason
A vEVENT can be reported only from a threaded IRQ context. Change to using request_threaded_irq to support that.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index dd7d030d2e89..ba029f7d24ce 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -824,8 +824,9 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->dev = smmu->impl_dev;
if (cmdqv->irq > 0) { - ret = request_irq(irq, tegra241_cmdqv_isr, 0, "tegra241-cmdqv", - cmdqv); + ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr, + IRQF_ONESHOT, "tegra241-cmdqv", + cmdqv); if (ret) { dev_err(cmdqv->dev, "failed to request irq (%d): %d\n", cmdqv->irq, ret);
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
A vEVENT can be reported only from a threaded IRQ context. Change to using request_threaded_irq to support that.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:40PM -0700, Nicolin Chen wrote:
A vEVENT can be reported only from a threaded IRQ context. Change to using request_threaded_irq to support that.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
The current flow of tegra241_cmdqv_remove_vintf() is: 1. For each LVCMDQ, tegra241_vintf_remove_lvcmdq(): a. Disable the LVCMDQ HW b. Release the LVCMDQ SW resource 2. For current VINTF, tegra241_vintf_hw_deinit(): c. Disable all LVCMDQ HWs d. Disable VINTF HW
Obviously, the step 1.a and the step 2.c are redundant.
Since tegra241_vintf_hw_deinit() disables all of its LVCMDQ HWs, it could simplify the flow in tegra241_cmdqv_remove_vintf() by calling that first: 1. For current VINTF, tegra241_vintf_hw_deinit(): a. Disable all LVCMDQ HWs b. Disable VINTF HW 2. Release all LVCMDQ SW resources
Drop tegra241_vintf_remove_lvcmdq(), and move tegra241_vintf_free_lvcmdq() as the new step 2.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index ba029f7d24ce..8d418c131b1b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -628,24 +628,17 @@ static int tegra241_cmdqv_init_vintf(struct tegra241_cmdqv *cmdqv, u16 max_idx,
/* Remove Helpers */
-static void tegra241_vintf_remove_lvcmdq(struct tegra241_vintf *vintf, u16 lidx) -{ - tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]); - tegra241_vintf_free_lvcmdq(vintf, lidx); -} - static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx) { struct tegra241_vintf *vintf = cmdqv->vintfs[idx]; u16 lidx;
+ tegra241_vintf_hw_deinit(vintf); + /* Remove LVCMDQ resources */ for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) if (vintf->lvcmdqs[lidx]) - tegra241_vintf_remove_lvcmdq(vintf, lidx); - - /* Remove VINTF resources */ - tegra241_vintf_hw_deinit(vintf); + tegra241_vintf_free_lvcmdq(vintf, lidx);
dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx); tegra241_cmdqv_deinit_vintf(cmdqv, idx);
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
The current flow of tegra241_cmdqv_remove_vintf() is:
- For each LVCMDQ, tegra241_vintf_remove_lvcmdq(): a. Disable the LVCMDQ HW b. Release the LVCMDQ SW resource
- For current VINTF, tegra241_vintf_hw_deinit(): c. Disable all LVCMDQ HWs d. Disable VINTF HW
Obviously, the step 1.a and the step 2.c are redundant.
Since tegra241_vintf_hw_deinit() disables all of its LVCMDQ HWs, it could simplify the flow in tegra241_cmdqv_remove_vintf() by calling that first:
- For current VINTF, tegra241_vintf_hw_deinit(): a. Disable all LVCMDQ HWs b. Disable VINTF HW
- Release all LVCMDQ SW resources
Drop tegra241_vintf_remove_lvcmdq(), and move tegra241_vintf_free_lvcmdq() as the new step 2.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, May 08, 2025 at 08:02:41PM -0700, Nicolin Chen wrote:
The current flow of tegra241_cmdqv_remove_vintf() is:
- For each LVCMDQ, tegra241_vintf_remove_lvcmdq(): a. Disable the LVCMDQ HW b. Release the LVCMDQ SW resource
- For current VINTF, tegra241_vintf_hw_deinit(): c. Disable all LVCMDQ HWs d. Disable VINTF HW
Obviously, the step 1.a and the step 2.c are redundant.
Since tegra241_vintf_hw_deinit() disables all of its LVCMDQ HWs, it could simplify the flow in tegra241_cmdqv_remove_vintf() by calling that first:
- For current VINTF, tegra241_vintf_hw_deinit(): a. Disable all LVCMDQ HWs b. Disable VINTF HW
- Release all LVCMDQ SW resources
Drop tegra241_vintf_remove_lvcmdq(), and move tegra241_vintf_free_lvcmdq() as the new step 2.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
To simplify the mappings from global VCMDQs to VINTFs' LVCMDQs, the design chose to do static allocations and mappings in the global reset function.
However, with the user-owned VINTF support, it exposes a security concern: if user space VM only wants one LVCMDQ for a VINTF, statically mapping two or more LVCMDQs creates a hidden VCMDQ that user space could DoS attack by writing random stuff to overwhelm the kernel with unhandleable IRQs.
Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be done dynamically.
HW allows pre-assigning global VCMDQs in the CMDQ_ALLOC registers, without finalizing the mappings by keeping CMDQV_CMDQ_ALLOCATED=0. So, add a pair of map/unmap helper that simply sets/clears that bit.
Delay the LVCMDQ mappings to tegra241_vintf_hw_init(), and the unmappings to tegra241_vintf_hw_deinit().
However, the dynamic LVCMDQ mapping/unmapping can complicate the timing of calling tegra241_vcmdq_hw_init/deinit(), which write LVCMDQ address space, i.e. requiring LVCMDQ to be mapped. Highlight that with a note to the top of either of them.
Acked-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 37 +++++++++++++++++-- 1 file changed, 33 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 8d418c131b1b..869c90b660c1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -351,6 +351,7 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
/* HW Reset Functions */
+/* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) { char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); @@ -379,6 +380,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h); }
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) { char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); @@ -404,16 +406,42 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) return 0; }
+/* Unmap a global VCMDQ from the pre-assigned LVCMDQ */ +static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq) +{ + u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); + + writel(regval & ~CMDQV_CMDQ_ALLOCATED, + REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + dev_dbg(vcmdq->cmdqv->dev, "%sunmapped\n", h); +} + static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) { - u16 lidx; + u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
- for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) - if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) + /* HW requires to unmap LVCMDQs in descending order */ + while (lidx--) { + if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) { tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]); + tegra241_vcmdq_unmap_lvcmdq(vintf->lvcmdqs[lidx]); + } + } vintf_write_config(vintf, 0); }
+/* Map a global VCMDQ to the pre-assigned LVCMDQ */ +static void tegra241_vcmdq_map_lvcmdq(struct tegra241_vcmdq *vcmdq) +{ + u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); + + writel(regval | CMDQV_CMDQ_ALLOCATED, + REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + dev_dbg(vcmdq->cmdqv->dev, "%smapped\n", h); +} + static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) { u32 regval; @@ -441,8 +469,10 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) */ vintf->hyp_own = !!(VINTF_HYP_OWN & readl(REG_VINTF(vintf, CONFIG)));
+ /* HW requires to map LVCMDQs in ascending order */ for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) { if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) { + tegra241_vcmdq_map_lvcmdq(vintf->lvcmdqs[lidx]); ret = tegra241_vcmdq_hw_init(vintf->lvcmdqs[lidx]); if (ret) { tegra241_vintf_hw_deinit(vintf); @@ -476,7 +506,6 @@ static int tegra241_cmdqv_hw_reset(struct arm_smmu_device *smmu) for (lidx = 0; lidx < cmdqv->num_lvcmdqs_per_vintf; lidx++) { regval = FIELD_PREP(CMDQV_CMDQ_ALLOC_VINTF, idx); regval |= FIELD_PREP(CMDQV_CMDQ_ALLOC_LVCMDQ, lidx); - regval |= CMDQV_CMDQ_ALLOCATED; writel_relaxed(regval, REG_CMDQV(cmdqv, CMDQ_ALLOC(qidx++))); }
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
To simplify the mappings from global VCMDQs to VINTFs' LVCMDQs, the design chose to do static allocations and mappings in the global reset function.
However, with the user-owned VINTF support, it exposes a security concern: if user space VM only wants one LVCMDQ for a VINTF, statically mapping two or more LVCMDQs creates a hidden VCMDQ that user space could DoS attack by writing random stuff to overwhelm the kernel with unhandleable IRQs.
Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be done dynamically.
HW allows pre-assigning global VCMDQs in the CMDQ_ALLOC registers, without finalizing the mappings by keeping CMDQV_CMDQ_ALLOCATED=0. So, add a pair of map/unmap helper that simply sets/clears that bit.
Delay the LVCMDQ mappings to tegra241_vintf_hw_init(), and the unmappings to tegra241_vintf_hw_deinit().
I don't know the specifics of tegra241-cmdqv. But the current description is a bit misleading. for native tegra241_vintf_hw_init() is called from reset so the mapping is still enabled in that path. for user-owned then tegra241_vcmdq_map_lvcmdq() is called from tegra241_vintf_alloc_lvcmdq_user() instead of tegra241_vintf_hw_init().
so nothing is actually delayed in this patch.
On Thu, May 15, 2025 at 08:20:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
To simplify the mappings from global VCMDQs to VINTFs' LVCMDQs, the design chose to do static allocations and mappings in the global reset function.
However, with the user-owned VINTF support, it exposes a security concern: if user space VM only wants one LVCMDQ for a VINTF, statically mapping two or more LVCMDQs creates a hidden VCMDQ that user space could DoS attack by writing random stuff to overwhelm the kernel with unhandleable IRQs.
Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be done dynamically.
HW allows pre-assigning global VCMDQs in the CMDQ_ALLOC registers, without finalizing the mappings by keeping CMDQV_CMDQ_ALLOCATED=0. So, add a pair of map/unmap helper that simply sets/clears that bit.
Delay the LVCMDQ mappings to tegra241_vintf_hw_init(), and the unmappings to tegra241_vintf_hw_deinit().
I don't know the specifics of tegra241-cmdqv. But the current description is a bit misleading. for native tegra241_vintf_hw_init() is called from reset so the mapping is still enabled in that path. for user-owned then tegra241_vcmdq_map_lvcmdq() is called from tegra241_vintf_alloc_lvcmdq_user() instead of tegra241_vintf_hw_init().
so nothing is actually delayed in this patch.
It is delayed from tegra241_cmdqv_hw_reset() to later functions.
The in-kernel VINTF0 is not literally delayed fundamentally as you mentioned because it needs to be allocated/mapped in probe().
The user-space VINTFs was previously allocated/mapped in probe(), now it's moved to user space handler, i.e. no mapping until user space asks for one.
Perhaps I can make this "delay" statement more narrative for a clarification.
Thanks Nicolin
The CMDQV HW supports a user-space use for virtualization cases. It allows the VM to issue guest-level TLBI or ATC_INV commands directly to the queue and executes them without a VMEXIT, as HW will replace the VMID field in a TLBI command and the SID field in an ATC_INV command with the preset VMID and SID.
This is built upon the vIOMMU infrastructure by allowing VMM to allocate a VINTF (as a vIOMMU object) and assign VCMDQs (HW QUEUE objs) to the VINTF.
So firstly, replace the standard vSMMU model with the VINTF implementation but reuse the standard cache_invalidate op (for unsupported commands) and the standard alloc_domain_nested op (for standard nested STE).
Each VINTF has two 64KB MMIO pages (128B per logical VCMDQ): - Page0 (directly accessed by guest) has all the control and status bits. - Page1 (trapped by VMM) has guest-owned queue memory location/size info.
VMM should trap the emulated VINTF0's page1 of the guest VM for the guest- level VCMDQ location/size info and forward that to the kernel to translate to a physical memory location to program the VCMDQ HW during an allocation call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0 of the guest VM. This allows the guest OS to read and write the guest-own VINTF's page0 for direct control of the VCMDQ HW.
For ATC invalidation commands that hold an SID, it requires all devices to register their virtual SIDs to the SID_MATCH registers and their physical SIDs to the pairing SID_REPLACE registers, so that HW can use those as a lookup table to replace those virtual SIDs with the correct physical SIDs. Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid structure to allocate SID_REPLACE and to program the SIDs accordingly.
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 15 + include/uapi/linux/iommufd.h | 51 ++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 6 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 413 +++++++++++++++++- 4 files changed, 476 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index bab7a9ce1283..d3f18a286447 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1000,6 +1000,14 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq, u64 *cmds, int n, bool sync);
+static inline phys_addr_t +arm_smmu_domain_ipa_to_pa(struct arm_smmu_domain *smmu_domain, u64 ipa) +{ + if (WARN_ON_ONCE(smmu_domain->stage != ARM_SMMU_DOMAIN_S2)) + return 0; + return iommu_iova_to_phys(&smmu_domain->domain, ipa); +} + #ifdef CONFIG_ARM_SMMU_V3_SVA bool arm_smmu_sva_supported(struct arm_smmu_device *smmu); bool arm_smmu_master_sva_supported(struct arm_smmu_master *master); @@ -1076,9 +1084,16 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state); void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master); int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt); +struct iommu_domain * +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags, + const struct iommu_user_data *user_data); +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array); #else #define arm_smmu_hw_info NULL #define arm_vsmmu_alloc NULL +#define arm_vsmmu_alloc_domain_nested NULL +#define arm_vsmmu_cache_invalidate NULL
static inline int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 28ab42a85268..a8a97f43ecc4 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -549,12 +549,26 @@ struct iommu_hw_info_vtd { __aligned_u64 ecap_reg; };
+/** + * enum iommu_hw_info_arm_smmuv3_flags - Flags for ARM SMMUv3 hw_info + * @IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV: Tegra241 implementation with + * CMDQV support; @impl is valid + */ +enum iommu_hw_info_arm_smmuv3_flags { + IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV = 1 << 0, +}; + /** * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) * - * @flags: Must be set to 0 - * @impl: Must be 0 + * @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags + * @impl: Implementation-defined bits when the following flags are set: + * - IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV + * Bits[15:12] - Log2 of the total number of SID replacements + * Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU + * Bits[07:04] - Log2 of the total number of VCMDQs per vIOMMU + * Bits[03:00] - Version number for the CMDQ-V HW * @idr: Implemented features for ARM SMMU Non-secure programming interface * @iidr: Information about the implementation and implementer of ARM SMMU, * and architecture version supported @@ -952,10 +966,28 @@ struct iommu_fault_alloc { * enum iommu_viommu_type - Virtual IOMMU Type * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3 */ enum iommu_viommu_type { IOMMU_VIOMMU_TYPE_DEFAULT = 0, IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2, +}; + +/** + * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface + * (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV) + * @out_vintf_mmap_offset: mmap offset argument for VINTF's page0 + * @out_vintf_mmap_length: mmap length argument for VINTF's page0 + * + * Both @out_vintf_mmap_offset and @out_vintf_mmap_length are reported by kernel + * for user space to mmap the VINTF page0 from the host physical address space + * to the guest physical address space so that a guest kernel can directly R/W + * access to the VINTF page0 in order to control its virtual command queues. + */ +struct iommu_viommu_tegra241_cmdqv { + __aligned_u64 out_vintf_mmap_offset; + __aligned_u64 out_vintf_mmap_length; };
/** @@ -1152,9 +1184,24 @@ struct iommu_veventq_alloc { /** * enum iommu_hw_queue_type - HW Queue Type * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use + * @IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for + * ARM SMMUv3 */ enum iommu_hw_queue_type { IOMMU_HW_QUEUE_TYPE_DEFAULT = 0, + /* + * TEGRA241_CMDQV requirements (otherwise, allocation will fail) + * - alloc starts from the lowest @index=0 in ascending order + * - destroy starts from the last allocated @index in descending order + * - @base_addr must be aligned to @length in bytes and mapped in IOAS + * - @length must be a power of 2, with a minimum 32 bytes and a maximum + * 2 ^ idr[1].CMDQS * 16 bytes (use GET_HW_INFO call to read idr[1] + * from struct iommu_hw_info_arm_smmuv3) + * - suggest to back the queue memory with contiguous physical pages or + * a single huge page with alignment of the queue size, limit vSMMU's + * IDR1.CMDQS to the huge page size divided by 16 bytes + */ + IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1, };
/** diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index 8ea9a022e444..e88adf00ec40 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -227,7 +227,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg, return 0; }
-static struct iommu_domain * +struct iommu_domain * arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags, const struct iommu_user_data *user_data) { @@ -338,8 +338,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu, return 0; }
-static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, - struct iommu_user_data_array *array) +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array) { struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core); struct arm_smmu_device *smmu = vsmmu->smmu; diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 869c90b660c1..4f3ecc265d6f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -8,7 +8,9 @@ #include <linux/dma-mapping.h> #include <linux/interrupt.h> #include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/iopoll.h> +#include <uapi/linux/iommufd.h>
#include <acpi/acpixf.h>
@@ -26,8 +28,10 @@ #define CMDQV_EN BIT(0)
#define TEGRA241_CMDQV_PARAM 0x0004 +#define CMDQV_NUM_SID_PER_VM_LOG2 GENMASK(15, 12) #define CMDQV_NUM_VINTF_LOG2 GENMASK(11, 8) #define CMDQV_NUM_VCMDQ_LOG2 GENMASK(7, 4) +#define CMDQV_VER GENMASK(3, 0)
#define TEGRA241_CMDQV_STATUS 0x0008 #define CMDQV_ENABLED BIT(0) @@ -53,6 +57,9 @@ #define VINTF_STATUS GENMASK(3, 1) #define VINTF_ENABLED BIT(0)
+#define TEGRA241_VINTF_SID_MATCH(s) (0x0040 + 0x4*(s)) +#define TEGRA241_VINTF_SID_REPLACE(s) (0x0080 + 0x4*(s)) + #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \ (0x00C0 + 0x8*(m)) #define LVCMDQ_ERR_MAP_NUM_64 2 @@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
/** * struct tegra241_vcmdq - Virtual Command Queue + * @core: Embedded iommufd_hw_queue structure * @idx: Global index in the CMDQV * @lidx: Local index in the VINTF * @enabled: Enable status * @cmdqv: Parent CMDQV pointer * @vintf: Parent VINTF pointer + * @prev: Previous LVCMDQ to depend on * @cmdq: Command Queue struct * @page0: MMIO Page0 base address * @page1: MMIO Page1 base address */ struct tegra241_vcmdq { + struct iommufd_hw_queue core; + u16 idx; u16 lidx;
@@ -131,22 +142,29 @@ struct tegra241_vcmdq {
struct tegra241_cmdqv *cmdqv; struct tegra241_vintf *vintf; + struct tegra241_vcmdq *prev; struct arm_smmu_cmdq cmdq;
void __iomem *page0; void __iomem *page1; }; +#define hw_queue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
/** * struct tegra241_vintf - Virtual Interface + * @vsmmu: Embedded arm_vsmmu structure * @idx: Global index in the CMDQV * @enabled: Enable status * @hyp_own: Owned by hypervisor (in-kernel) * @cmdqv: Parent CMDQV pointer * @lvcmdqs: List of logical VCMDQ pointers * @base: MMIO base address + * @mmap_offset: Offset argument for mmap() syscall + * @sids: Stream ID replacement resources */ struct tegra241_vintf { + struct arm_vsmmu vsmmu; + u16 idx;
bool enabled; @@ -156,6 +174,24 @@ struct tegra241_vintf { struct tegra241_vcmdq **lvcmdqs;
void __iomem *base; + unsigned long mmap_offset; + + struct ida sids; +}; +#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core) + +/** + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID + * @vintf: Parent VINTF pointer + * @sid: Physical Stream ID + * @idx: Replacement index in the VINTF + */ +struct tegra241_vintf_sid { + struct iommufd_vdevice core; + struct tegra241_vintf *vintf; + u32 sid; + u8 idx; };
/** @@ -163,10 +199,12 @@ struct tegra241_vintf { * @smmu: SMMUv3 device * @dev: CMDQV device * @base: MMIO base address + * @base_phys: MMIO physical base address, for mmap * @irq: IRQ number * @num_vintfs: Total number of VINTFs * @num_vcmdqs: Total number of VCMDQs * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF + * @num_sids_per_vintf: Total number of SID replacements per VINTF * @vintf_ids: VINTF id allocator * @vintfs: List of VINTFs */ @@ -175,12 +213,14 @@ struct tegra241_cmdqv { struct device *dev;
void __iomem *base; + phys_addr_t base_phys; int irq;
/* CMDQV Hardware Params */ u16 num_vintfs; u16 num_vcmdqs; u16 num_lvcmdqs_per_vintf; + u16 num_sids_per_vintf;
struct ida vintf_ids;
@@ -351,6 +391,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
/* HW Reset Functions */
+/* + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC + * following an ATC_INV command at the end of the guest queue while this ATC_INV + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned + * to the next VM, which will be a false alarm potentially causing some unwanted + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ. + */ +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq) +{ + struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu; + u64 cmd_sync[CMDQ_ENT_DWORDS] = {}; + + cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) | + FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE); + + /* + * It does not hurt to insert another CMD_SYNC, taking advantage of the + * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion. + */ + arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true); +} + /* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) { @@ -364,6 +427,8 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR)), readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, CONS))); } + tegra241_vcmdq_hw_flush_timeout(vcmdq); + writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, PROD)); writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, CONS)); writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE)); @@ -380,6 +445,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h); }
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */ +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) +{ + writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); +} + /* This function is for LVCMDQ, so @vcmdq must be mapped prior */ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) { @@ -390,7 +461,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) tegra241_vcmdq_hw_deinit(vcmdq);
/* Configure and enable VCMDQ */ - writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); + _tegra241_vcmdq_hw_init(vcmdq);
ret = vcmdq_write_config(vcmdq, VCMDQ_EN); if (ret) { @@ -420,6 +491,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq) static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) { u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf; + int sidx;
/* HW requires to unmap LVCMDQs in descending order */ while (lidx--) { @@ -429,6 +501,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) } } vintf_write_config(vintf, 0); + for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) { + writel(0, REG_VINTF(vintf, SID_MATCH(sidx))); + writel(0, REG_VINTF(vintf, SID_REPLACE(sidx))); + } }
/* Map a global VCMDQ to the pre-assigned LVCMDQ */ @@ -457,7 +533,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a * restricted set of supported commands. */ - regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own); + regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) | + FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid); writel(regval, REG_VINTF(vintf, CONFIG));
ret = vintf_write_config(vintf, regval | VINTF_EN); @@ -584,7 +661,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
dev_dbg(vintf->cmdqv->dev, "%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64)); - kfree(vcmdq); + /* Guest-owned VCMDQ is free-ed with hw_queue by iommufd core */ + if (vcmdq->vintf->hyp_own) + kfree(vcmdq); }
static struct tegra241_vcmdq * @@ -671,7 +750,11 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx); tegra241_cmdqv_deinit_vintf(cmdqv, idx); - kfree(vintf); + if (!vintf->hyp_own) + ida_destroy(&vintf->sids); + /* Guest-owned VINTF is free-ed with viommu by iommufd core */ + if (vintf->hyp_own) + kfree(vintf); }
static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu) @@ -699,10 +782,34 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu) put_device(cmdqv->dev); /* smmu->impl_dev */ }
+static struct arm_vsmmu * +tegra241_cmdqv_alloc_vintf_user(struct arm_smmu_device *smmu, + struct arm_smmu_domain *smmu_domain, + struct iommufd_ctx *ictx, unsigned int type, + const struct iommu_user_data *user_data); + +static u32 tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *impl) +{ + struct tegra241_cmdqv *cmdqv = + container_of(smmu, struct tegra241_cmdqv, smmu); + u32 regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM)); + + *impl = FIELD_GET(CMDQV_VER, regval); + *impl |= FIELD_PREP(CMDQV_NUM_VCMDQ_LOG2, + ilog2(cmdqv->num_lvcmdqs_per_vintf)); + /* One VINTF per VM is enough */ + *impl |= FIELD_PREP(CMDQV_NUM_VINTF_LOG2, ilog2(1)); + *impl |= FIELD_PREP(CMDQV_NUM_SID_PER_VM_LOG2, + ilog2(cmdqv->num_sids_per_vintf)); + return IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV; +} + static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = { .get_secondary_cmdq = tegra241_cmdqv_get_cmdq, .device_reset = tegra241_cmdqv_hw_reset, .device_remove = tegra241_cmdqv_remove, + .vsmmu_alloc = tegra241_cmdqv_alloc_vintf_user, + .hw_info = tegra241_cmdqv_hw_info, };
/* Probe Functions */ @@ -844,6 +951,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->irq = irq; cmdqv->base = base; cmdqv->dev = smmu->impl_dev; + cmdqv->base_phys = res->start;
if (cmdqv->irq > 0) { ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr, @@ -860,6 +968,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval); cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval); cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs; + cmdqv->num_sids_per_vintf = + 1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
cmdqv->vintfs = kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL); @@ -913,3 +1023,298 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu) put_device(smmu->impl_dev); return ERR_PTR(-ENODEV); } + +/* User space VINTF and VCMDQ Functions */ + +static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq) +{ + char header[64]; + + /* Configure the vcmdq only; User space does the enabling */ + _tegra241_vcmdq_hw_init(vcmdq); + + dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n", + lvcmdq_error_header(vcmdq, header, 64), + vcmdq->cmdq.q.q_base & VCMDQ_ADDR, + 1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE)); + return 0; +} + +static struct iommufd_hw_queue * +tegra241_vintf_alloc_lvcmdq_user(struct iommufd_viommu *viommu, + unsigned int type, u32 lidx, u64 base_addr, + size_t length) +{ + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + struct tegra241_cmdqv *cmdqv = vintf->cmdqv; + struct arm_smmu_device *smmu = &cmdqv->smmu; + struct tegra241_vcmdq *vcmdq, *prev = NULL; + u32 log2size, max_n_shift; + phys_addr_t q_base; + char header[64]; + int ret; + + if (type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV) + return ERR_PTR(-EOPNOTSUPP); + if (lidx >= cmdqv->num_lvcmdqs_per_vintf) + return ERR_PTR(-EINVAL); + if (vintf->lvcmdqs[lidx]) + return ERR_PTR(-EEXIST); + + /* + * HW requires to map LVCMDQs in ascending order, so reject if the + * previous lvcmdqs is not allocated yet. + */ + if (lidx) { + prev = vintf->lvcmdqs[lidx - 1]; + if (!prev) + return ERR_PTR(-EIO); + } + + /* + * @length must be a power of 2, in range of + * [ 32, 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ] + */ + max_n_shift = FIELD_GET(IDR1_CMDQS, + readl_relaxed(smmu->base + ARM_SMMU_IDR1)); + if (!is_power_of_2(length) || length < 32 || + length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) + return ERR_PTR(-EINVAL); + log2size = ilog2(length) - CMDQ_ENT_SZ_SHIFT; + + /* @base_addr must be aligned to @length */ + if (base_addr & ~VCMDQ_ADDR || base_addr & (length - 1)) + return ERR_PTR(-EINVAL); + + /* @base_addr must be mapped in the s2_parent domain */ + q_base = arm_smmu_domain_ipa_to_pa(vintf->vsmmu.s2_parent, base_addr); + if (!q_base) + return ERR_PTR(-ENXIO); + + vcmdq = iommufd_hw_queue_alloc(viommu, struct tegra241_vcmdq, core); + if (!vcmdq) + return ERR_PTR(-ENOMEM); + + /* + * HW requires to unmap LVCMDQs in descending order, so destroy() must + * follow this rule. Set a dependency on its previous LVCMDQ so iommufd + * core will help enforce it. + */ + if (prev) { + ret = iommufd_hw_queue_depend(vcmdq, prev, core); + if (ret) + goto free_vcmdq; + } + vcmdq->prev = prev; + + ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq); + if (ret) + goto undepend_vcmdq; + + dev_dbg(cmdqv->dev, "%sallocated\n", + lvcmdq_error_header(vcmdq, header, 64)); + + tegra241_vcmdq_map_lvcmdq(vcmdq); + + vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR; + vcmdq->cmdq.q.q_base |= log2size; + + ret = tegra241_vcmdq_hw_init_user(vcmdq); + if (ret) + goto unmap_lvcmdq; + vintf->lvcmdqs[lidx] = vcmdq; + + return &vcmdq->core; + +unmap_lvcmdq: + tegra241_vcmdq_unmap_lvcmdq(vcmdq); + tegra241_vintf_deinit_lvcmdq(vintf, lidx); +undepend_vcmdq: + if (vcmdq->prev) + iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core); +free_vcmdq: + iommufd_struct_destroy(vcmdq, core); + return ERR_PTR(ret); +} + +static void +tegra241_vintf_destroy_lvcmdq_user(struct iommufd_hw_queue *hw_queue) +{ + struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue); + + tegra241_vcmdq_hw_deinit(vcmdq); + tegra241_vcmdq_unmap_lvcmdq(vcmdq); + tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx); + if (vcmdq->prev) + iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core); + + /* IOMMUFD core frees the memory of vcmdq and hw_queue */ +} + +static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu) +{ + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + + if (vintf->mmap_offset) + iommufd_viommu_destroy_mmap(vintf, vsmmu.core, + vintf->mmap_offset); + tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx); + + /* IOMMUFD core frees the memory of vintf and viommu */ +} + +static struct iommufd_vdevice * +tegra241_vintf_alloc_vdevice(struct iommufd_viommu *viommu, struct device *dev, + u64 virt_sid) +{ + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + struct arm_smmu_stream *stream = &master->streams[0]; + struct tegra241_vintf_sid *vsid; + int sidx; + + if (virt_sid > UINT_MAX) + return ERR_PTR(-EINVAL); + + vsid = iommufd_vdevice_alloc(viommu, struct tegra241_vintf_sid, core); + if (!vsid) + return ERR_PTR(-ENOMEM); + + WARN_ON_ONCE(master->num_streams != 1); + + /* Find an empty pair of SID_REPLACE and SID_MATCH */ + sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1, + GFP_KERNEL); + if (sidx < 0) { + iommufd_struct_destroy(vsid, core); + return ERR_PTR(sidx); + } + + writel(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx))); + writel(virt_sid << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx))); + dev_dbg(vintf->cmdqv->dev, + "VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n", + vintf->idx, sidx, stream->id, (u32)virt_sid); + + vsid->idx = sidx; + vsid->vintf = vintf; + vsid->sid = stream->id; + + return &vsid->core; +} + +static void tegra241_vintf_destroy_vdevice(struct iommufd_vdevice *vdev) +{ + struct tegra241_vintf_sid *vsid = + container_of(vdev, struct tegra241_vintf_sid, core); + struct tegra241_vintf *vintf = vsid->vintf; + + writel(0, REG_VINTF(vintf, SID_MATCH(vsid->idx))); + writel(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx))); + ida_free(&vintf->sids, vsid->idx); + dev_dbg(vintf->cmdqv->dev, + "VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx, + vsid->idx, vsid->sid); + + /* IOMMUFD core frees the memory of vsid and vdev */ +} + +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = { + .destroy = tegra241_cmdqv_destroy_vintf_user, + .alloc_domain_nested = arm_vsmmu_alloc_domain_nested, + .cache_invalidate = arm_vsmmu_cache_invalidate, + .vdevice_alloc = tegra241_vintf_alloc_vdevice, + .vdevice_destroy = tegra241_vintf_destroy_vdevice, + .hw_queue_alloc = tegra241_vintf_alloc_lvcmdq_user, + .hw_queue_destroy = tegra241_vintf_destroy_lvcmdq_user, +}; + +static struct arm_vsmmu * +tegra241_cmdqv_alloc_vintf_user(struct arm_smmu_device *smmu, + struct arm_smmu_domain *s2_parent, + struct iommufd_ctx *ictx, unsigned int type, + const struct iommu_user_data *user_data) +{ + struct tegra241_cmdqv *cmdqv = + container_of(smmu, struct tegra241_cmdqv, smmu); + struct iommu_viommu_tegra241_cmdqv data; + struct tegra241_vintf *vintf; + phys_addr_t page0_base; + int ret; + + if (type != IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV) + return ERR_PTR(-EOPNOTSUPP); + if (!user_data) + return ERR_PTR(-EINVAL); + + ret = iommu_copy_struct_from_user(&data, user_data, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV, + out_vintf_mmap_length); + if (ret) + return ERR_PTR(ret); + + vintf = iommufd_viommu_alloc(ictx, struct tegra241_vintf, vsmmu.core, + &tegra241_cmdqv_viommu_ops); + if (!vintf) + return ERR_PTR(-ENOMEM); + + ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf); + if (ret < 0) { + dev_err(cmdqv->dev, "no more available vintf\n"); + goto free_vintf; + } + + vintf->vsmmu.smmu = smmu; + vintf->vsmmu.s2_parent = s2_parent; + /* FIXME Move VMID allocation from the S2 domain allocation to here */ + vintf->vsmmu.vmid = s2_parent->s2_cfg.vmid; + + /* + * Initialize the user-owned VINTF without a LVCMDQ, because it has to + * wait for the allocation of a user-owned LVCMDQ, for security reason. + * It is different than the kernel-owned VINTF0, which had pre-assigned + * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs + * by the tegra241_vintf_hw_init() call. + */ + ret = tegra241_vintf_hw_init(vintf, false); + if (ret) + goto deinit_vintf; + + vintf->lvcmdqs = kcalloc(cmdqv->num_lvcmdqs_per_vintf, + sizeof(*vintf->lvcmdqs), GFP_KERNEL); + if (!vintf->lvcmdqs) { + ret = -ENOMEM; + goto hw_deinit_vintf; + } + + page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx); + ret = iommufd_viommu_alloc_mmap(vintf, vsmmu.core, page0_base, SZ_64K, + &vintf->mmap_offset); + if (ret) + goto hw_deinit_vintf; + + data.out_vintf_mmap_length = SZ_64K; + data.out_vintf_mmap_offset = vintf->mmap_offset; + ret = iommu_copy_struct_to_user(user_data, &data, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV, + out_vintf_mmap_length); + if (ret) + goto free_mmap; + + ida_init(&vintf->sids); + + dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx, + vintf->vsmmu.vmid); + + return &vintf->vsmmu; + +free_mmap: + iommufd_viommu_destroy_mmap(vintf, vsmmu.core, vintf->mmap_offset); +hw_deinit_vintf: + tegra241_vintf_hw_deinit(vintf); +deinit_vintf: + tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx); +free_vintf: + iommufd_struct_destroy(vintf, vsmmu.core); + return ERR_PTR(ret); +}
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
/**
- struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware
information
(IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
- @flags: Must be set to 0
- @impl: Must be 0
- @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags
- @impl: Implementation-defined bits when the following flags are set:
- IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV
Bits[15:12] - Log2 of the total number of SID replacements
Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU
Bits[07:04] - Log2 of the total number of VCMDQs per vIOMMU
Bits[03:00] - Version number for the CMDQ-V HW
hmm throughout this series I drew an equation between VINTF and vIOMMU. Not sure how multiple VINTFs can be represented w/o introducing more objects. Do we want to keep such info here?
* - suggest to back the queue memory with contiguous physical
pages or
* a single huge page with alignment of the queue size, limit
vSMMU's
* IDR1.CMDQS to the huge page size divided by 16 bytes
*/
- IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
Not sure about the last sentence. 'limit' refers to a certain action which the user should perform?
- ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
- if (ret)
goto undepend_vcmdq;
- dev_dbg(cmdqv->dev, "%sallocated\n",
lvcmdq_error_header(vcmdq, header, 64));
- tegra241_vcmdq_map_lvcmdq(vcmdq);
- vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
- vcmdq->cmdq.q.q_base |= log2size;
- ret = tegra241_vcmdq_hw_init_user(vcmdq);
- if (ret)
goto unmap_lvcmdq;
- vintf->lvcmdqs[lidx] = vcmdq;
this is already done in tegra241_vintf_init_lvcmdq().
On Thu, May 15, 2025 at 08:27:17AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
/**
- struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware
information
(IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
- @flags: Must be set to 0
- @impl: Must be 0
- @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags
- @impl: Implementation-defined bits when the following flags are set:
- IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV
Bits[15:12] - Log2 of the total number of SID replacements
Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU
Bits[07:04] - Log2 of the total number of VCMDQs per vIOMMU
Bits[03:00] - Version number for the CMDQ-V HW
hmm throughout this series I drew an equation between VINTF and vIOMMU. Not sure how multiple VINTFs can be represented w/o introducing more objects. Do we want to keep such info here?
You are right that VINTF=vIOMMU. This is a per SMMU instance ioctl. So, each VM should only have one VTINF/vIOMMU per SMMU instance.
For multi-VINTF (multi-vIOMMU) case, there needs to be more SMMUs backing passthrough devices being assigned to the VM.
What exactly the concern of keeping this info here?
* - suggest to back the queue memory with contiguous physical
pages or
* a single huge page with alignment of the queue size, limit
vSMMU's
* IDR1.CMDQS to the huge page size divided by 16 bytes
*/
- IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
Not sure about the last sentence. 'limit' refers to a certain action which the user should perform?
Yes, set vSMMU's IDR1.CMDQS field up to the huge page size divided by 16 bytes, e.g. if using one 2MB huge page backing the queue memory, VMM should set IDR1.CMDQS no higher than 17: 2MB = (1 << 17) * 16B
Certainly, it can set to lower than 17. So it's an upper "limit".
Or any better word in your mind that can be less confusing?
- ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
- if (ret)
goto undepend_vcmdq;
- dev_dbg(cmdqv->dev, "%sallocated\n",
lvcmdq_error_header(vcmdq, header, 64));
- tegra241_vcmdq_map_lvcmdq(vcmdq);
- vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
- vcmdq->cmdq.q.q_base |= log2size;
- ret = tegra241_vcmdq_hw_init_user(vcmdq);
- if (ret)
goto unmap_lvcmdq;
- vintf->lvcmdqs[lidx] = vcmdq;
this is already done in tegra241_vintf_init_lvcmdq().
Oh, will drop that.
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 1:14 AM
On Thu, May 15, 2025 at 08:27:17AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
/**
- struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware
information
(IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
- @flags: Must be set to 0
- @impl: Must be 0
- @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags
- @impl: Implementation-defined bits when the following flags are set:
- IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV
Bits[15:12] - Log2 of the total number of SID replacements
Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU
Bits[07:04] - Log2 of the total number of VCMDQs per vIOMMU
Bits[03:00] - Version number for the CMDQ-V HW
hmm throughout this series I drew an equation between VINTF and vIOMMU. Not sure how multiple VINTFs can be represented w/o introducing more objects. Do we want to keep such info here?
You are right that VINTF=vIOMMU. This is a per SMMU instance ioctl. So, each VM should only have one VTINF/vIOMMU per SMMU instance.
For multi-VINTF (multi-vIOMMU) case, there needs to be more SMMUs backing passthrough devices being assigned to the VM.
What exactly the concern of keeping this info here?
First, you agreed that VINTF=vIOMMU, then "total number of VINTFs per vIOMMU" doesn't make sense as it's fixed to 1 in concept.
Then, each VM can only get one VINTF/vIOMMU per SMMU instance, and this ioctl is per SMMU instance. This also implies that only one VINTF can be reported in the ioctl.
In multi-VINTF case, the VM should get 1VINTF per ioctl from each SMMU backing passthrough devices.
Then what is the point of " Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU "?
* - suggest to back the queue memory with contiguous physical
pages or
* a single huge page with alignment of the queue size, limit
vSMMU's
* IDR1.CMDQS to the huge page size divided by 16 bytes
*/
- IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
Not sure about the last sentence. 'limit' refers to a certain action which the user should perform?
Yes, set vSMMU's IDR1.CMDQS field up to the huge page size divided by 16 bytes, e.g. if using one 2MB huge page backing the queue memory, VMM should set IDR1.CMDQS no higher than 17: 2MB = (1 << 17) * 16B
Certainly, it can set to lower than 17. So it's an upper "limit".
Or any better word in your mind that can be less confusing?
No. I misread 'vSMMU' as 'SMMU'.
On Fri, May 16, 2025 at 04:00:58AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 16, 2025 1:14 AM
On Thu, May 15, 2025 at 08:27:17AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, May 9, 2025 11:03 AM
/**
- struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware
information
(IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
- @flags: Must be set to 0
- @impl: Must be 0
- @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags
- @impl: Implementation-defined bits when the following flags are set:
- IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV
Bits[15:12] - Log2 of the total number of SID replacements
Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU
Bits[07:04] - Log2 of the total number of VCMDQs per vIOMMU
Bits[03:00] - Version number for the CMDQ-V HW
hmm throughout this series I drew an equation between VINTF and vIOMMU. Not sure how multiple VINTFs can be represented w/o introducing more objects. Do we want to keep such info here?
You are right that VINTF=vIOMMU. This is a per SMMU instance ioctl. So, each VM should only have one VTINF/vIOMMU per SMMU instance.
For multi-VINTF (multi-vIOMMU) case, there needs to be more SMMUs backing passthrough devices being assigned to the VM.
What exactly the concern of keeping this info here?
First, you agreed that VINTF=vIOMMU, then "total number of VINTFs per vIOMMU" doesn't make sense as it's fixed to 1 in concept.
Then, each VM can only get one VINTF/vIOMMU per SMMU instance, and this ioctl is per SMMU instance. This also implies that only one VINTF can be reported in the ioctl.
In multi-VINTF case, the VM should get 1VINTF per ioctl from each SMMU backing passthrough devices.
Then what is the point of " Bits[11:08] - Log2 of the total number of VINTFs per vIOMMU "?
It was not there in the previous version. Pranjal pointed out that it was missing a field.
You are right that this field must set to 0, indicating only single VINTF is allowed.
Given Jason's comments about this impl rework (to a data structure), I think I will just drop this number of VTINFs. Instead will add a line of comments say that VMM should set this field to 0, i.e. only provide VM one VINTF per SMMU.
Thanks Nicolin
Add a new vEVENTQ type for VINTFs that are assigned to the user space. Simply report the two 64-bit LVCMDQ_ERR_MAPs register values.
Reviewed-by: Alok Tiwari alok.a.tiwari@oracle.com Reviewed-by: Pranjal Shrivastava praan@google.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/uapi/linux/iommufd.h | 15 +++++++++++++ .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 22 +++++++++++++++++++ 2 files changed, 37 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index a8a97f43ecc4..b566f12d99f3 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -1114,10 +1114,12 @@ struct iommufd_vevent_header { * enum iommu_veventq_type - Virtual Event Queue Type * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use * @IOMMU_VEVENTQ_TYPE_ARM_SMMUV3: ARM SMMUv3 Virtual Event Queue + * @IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension IRQ */ enum iommu_veventq_type { IOMMU_VEVENTQ_TYPE_DEFAULT = 0, IOMMU_VEVENTQ_TYPE_ARM_SMMUV3 = 1, + IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV = 2, };
/** @@ -1141,6 +1143,19 @@ struct iommu_vevent_arm_smmuv3 { __aligned_le64 evt[4]; };
+/** + * struct iommu_vevent_tegra241_cmdqv - Tegra241 CMDQV IRQ + * (IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV) + * @lvcmdq_err_map: 128-bit logical vcmdq error map, little-endian. + * (Refer to register LVCMDQ_ERR_MAPs per VINTF ) + * + * The 128-bit register value from HW exclusively reflect the error bits for a + * Virtual Interface represented by a vIOMMU object. Read and report directly. + */ +struct iommu_vevent_tegra241_cmdqv { + __aligned_le64 lvcmdq_err_map[2]; +}; + /** * struct iommu_veventq_alloc - ioctl(IOMMU_VEVENTQ_ALLOC) * @size: sizeof(struct iommu_veventq_alloc) diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 4f3ecc265d6f..ba0aed82fece 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -292,6 +292,20 @@ static inline int vcmdq_write_config(struct tegra241_vcmdq *vcmdq, u32 regval)
/* ISR Functions */
+static void tegra241_vintf_user_handle_error(struct tegra241_vintf *vintf) +{ + struct iommufd_viommu *viommu = &vintf->vsmmu.core; + struct iommu_vevent_tegra241_cmdqv vevent_data; + int i; + + for (i = 0; i < LVCMDQ_ERR_MAP_NUM_64; i++) + vevent_data.lvcmdq_err_map[i] = + readq_relaxed(REG_VINTF(vintf, LVCMDQ_ERR_MAP_64(i))); + + iommufd_viommu_report_event(viommu, IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV, + &vevent_data, sizeof(vevent_data)); +} + static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf) { int i; @@ -337,6 +351,14 @@ static irqreturn_t tegra241_cmdqv_isr(int irq, void *devid) vintf_map &= ~BIT_ULL(0); }
+ /* Handle other user VINTFs and their LVCMDQs */ + while (vintf_map) { + unsigned long idx = __ffs64(vintf_map); + + tegra241_vintf_user_handle_error(cmdqv->vintfs[idx]); + vintf_map &= ~BIT_ULL(idx); + } + return IRQ_HANDLED; }
linux-kselftest-mirror@lists.linaro.org